Hello and welcome to my data science project about penguins!
All relevant files and figures in this notebook (i.e. the .ipynb and .html file for this notebook, as well as the graphs) can be found in my public github repo, linked here: https://github.com/dliao1/dliao1.github.io
Onto the project!
In today's constantly evolving society, climate change is becoming a growing problem. As carbon dioxide levels rise, so does the rate at which coral reefs die off the oceans. Society continues to grow and produce, sometimes at the cost of the wildlife habitats. Greenhouse gas emmissions and pollution continue to rise with the advent of cars and factories, and global temperatures have also reflected this trend. Many people have also turned their attention to the Earth's North and Southern poles. Every year the Earth's natural glaciers melt at an alarming rate, which further results in the loss of habitat for animals that inhabit these environments.
One such animal consists of penguins. Antarctica is home to many penguin species, such as emperor penguins (as seen in Happy Feet!), king penguins, and macaroni penguins, among others. Although these penguins' habitats have become endangered, conservation efforts have been made to protect the wildlife in Antarctica, such as giving endangered animals homes in zoos and aquariums where they can be studied and nurtured.
One such dataset that exists is called Palmer's Penguins, which contains valuable data of approximately 344 penguins. In this dataset, a number of observations have been recorded about these penguins, such as their island of origin, species, bill length and depth, body mass in grams, and gender. This exploration aims to analyze these differences between three species of penguins: Adelie, Gentoo, and Chinstrap. In it, we will explore how bill lengths, bill depths, flipper lengths, and body mass differ between different penguin species and different genders. Furthermore, we will take a machine learning approach and attempt to use decision trees, k-NN, and random forest classification methods to attempt to classify penguins into different species based upon their demographic attributes, as listed earlier.
This project's ultimate goal to analyze these differences between penguin species and see if machine learning can be applied in this context to identify penguin species' based on their characteristics, which could ultimately lead to more effective penguin identification and conservation in the future. Furthermore, since this dataset includes quantitave measurements like a penguin's bill length/depth, flipper length, and body mass, analyzing these characteristics could lead to more insight into how each penguin's attributes could affect susceptibility to certain penguin-related diseases (E. coli, Avian viruses, etc.), as well how conservations and zoologists could better tailor medical devices to different penguin species.
Here we import all the modules needed for analysis of the data:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
import seaborn as sns
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from IPython.display import Image
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import abline_plot
from patsy import dmatrices, dmatrix, demo_data, ContrastMatrix, Poly
Here we read in the data into a dataframe using pandas. The dataset's source is from Kaggle, linked here: https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data?select=penguins_size.csv
As an overview of the variables contained in the dataset, they are as follows:
df = pd.read_csv("penguins_size.csv")
df.head()
| species | island | culmen_length_mm | culmen_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE |
Next, we need to clean the data! Since this dataset is already from Kaggle, most of the work has already been done for us. However, some of the column names aren't exactly intuitive and we'll be renaming those (i.e. culmen_length instead of bill length). Furthermore, there are some missing values in the dataset, so all rows with missing values will be dropped. To make it easier to create linear regression equations involving a penguin's sex, we will be converted the Sex column to numbers: 0 will represent a male, and 1 will represent a female. We'll also drop the island column since we don't plan on using that in our analysis.
# Renaming the columns to more intuitive names
df.rename(columns = {"species" : "Species", "island" : "Island", "culmen_length_mm": "Bill length (mm)",
"culmen_depth_mm" : "Bill depth (mm)", "flipper_length_mm" : "Flipper length (mm)",
"body_mass_g" : "Body mass (grams)", "sex" : "Sex"}, inplace = True)
# Drops all columns with NaN values
df.dropna(inplace = True)
# Ensures all the inputs in the gender columns are valid
df = df[df["Sex"].str.contains("MALE") | df["Sex"].str.contains("FEMALE")]
# Changes values in the gender column to 0 if male and 1 if female
df["Sex"].replace(to_replace={"FEMALE" : 1, "MALE" : 0}, inplace=True)
#Drops the island column since that will not be used in analysis
df.drop(columns = ["Island"], inplace = True)
# Resets indices
df.reset_index(drop = True, inplace = True)
# Ensures all the measurements for bill length/depth, flipper length, and body mass are floats
for i in range(0, len(df)):
df.at[i, "Bill length (mm)"] = float(df.at[i, "Bill length (mm)"])
df.at[i, "Bill depth (mm)"] = float(df.at[i, "Bill depth (mm)"])
df.at[i, "Flipper length (mm)"] = float(df.at[i, "Flipper length (mm)"])
df.at[i, "Body mass (grams)"] = float(df.at[i, "Body mass (grams)"])
df.head()
| Species | Bill length (mm) | Bill depth (mm) | Flipper length (mm) | Body mass (grams) | Sex | |
|---|---|---|---|---|---|---|
| 0 | Adelie | 39.1 | 18.7 | 181.0 | 3750.0 | 0 |
| 1 | Adelie | 39.5 | 17.4 | 186.0 | 3800.0 | 1 |
| 2 | Adelie | 40.3 | 18.0 | 195.0 | 3250.0 | 1 |
| 3 | Adelie | 36.7 | 19.3 | 193.0 | 3450.0 | 1 |
| 4 | Adelie | 39.3 | 20.6 | 190.0 | 3650.0 | 0 |
Now that the data's been cleaned sufficiently, we can move onto analysis!
For this data analysis, first we are interested in grouping each penguin by its species and then creating histograms of the penguins' different attributes, split by gender.
First, let's separate the data into male and female categories, based upon their species. We can make four overlapped histograms per species and see if there's an overall trend in the bill lengths, bill depths, flipper lengths, and body mass of penguins. The male distribution will be plotted in blue and the female distribution will be plotted in pink.
First, let's start with Adelie penguins:
# Isolating Adelie Penguin Data
adelie_data = df[df["Species"].str.contains("Adelie")]
adelie_data.reset_index(inplace = True, drop = True)
#Isolating into male versus female adelie penguins
male_data_adelie = adelie_data[adelie_data["Sex"] == 0]
female_data_adelie = adelie_data[adelie_data["Sex"] == 1]
# Plots the data in overlapping histograms
fig, ax = plt.subplots(4,1, figsize = (15,20))
ax[0].hist([male_data_adelie["Bill length (mm)"], female_data_adelie["Bill length (mm)"]], color= ["blue", "pink"])
ax[0].set_xlabel("Bill Length (mm)")
ax[0].set_ylabel("Frequency")
ax[0].set_title("Histogram for Adelie Penguin Bill Length (mm) by gender")
ax[1].hist([male_data_adelie["Bill depth (mm)"], female_data_adelie["Bill depth (mm)"]], color= ["blue", "pink"])
ax[1].set_xlabel("Bill Depth (mm)")
ax[1].set_ylabel("Frequency")
ax[1].set_title("Histogram for Adelie Penguin Bill Depth (mm) by gender")
ax[2].hist([male_data_adelie["Flipper length (mm)"], female_data_adelie["Flipper length (mm)"]], color= ["blue", "pink"])
ax[2].set_xlabel("Flipper length (mm)")
ax[2].set_ylabel("Frequency")
ax[2].set_title("Histogram for Adelie Penguin Flipper Length (mm) by gender")
ax[3].hist([male_data_adelie["Body mass (grams)"], female_data_adelie["Body mass (grams)"]], color= ["blue", "pink"])
ax[3].set_xlabel("Body mass (grams)")
ax[3].set_ylabel("Frequency")
ax[3].set_title("Histogram for Adelie Penguin Body Mass (grams) by gender")
fig.savefig("adelie_gender_diff.png")
As suspected, it seems that the histograms for male attributes versus female attributes for Adelie penguins differ somewhat. For all eight distributions (four per gender), it seems that the data is more or less unimodal.
Looking at the first histogram comparing male and female Adelie penguin bill lengths, both distributions are unimodal, but the male distribution seems somewhat symmetric, if not slightly left skewed. In contrast, the female distribution seems more right skewed. The peak of male bill lengths is at 41 mm, while the peak of female bill lengths is at around 36 mm. The distance between these peaks suggest that on average, male Adelie penguin bill lengths tend to be longer than female bill lengths.
Similarly, for bill depths, both distributions are unimodal. However, male bill depths tend to more symmetrically distributed while female bill depths are right skewed. The female distribution is centered at 17 mm, while the male distribution is centered at around 19 mm. Similar to the trends we observed in bill length, this also suggests that male bill depth tends to be larger than female bill depths.
Concerning flipper length, it appears that the male distribution seems to be more left skewed, while the female distribution seems to be more right skewed. The most common male flipper length is 195 mm, whilst the most commonly observed female flipper length is 185 mm. Consistent with the previous observations, this suggests that male flipper length tends to be longer than female flipper lengths. It also seems that male Adelie penguins have a slightly larger range of flipper lengths when compared to their female counterparts.
Lastly, looking at overall body mass in grams, both the male and female distribution seem more or less symmetrically distributed. For the female distribution, there is an overall peak at 3500 grams, while for males, there is a peak at around 4000 grams. This also suggests that male Adelie penguins tend to have more body mass than female Adelie penguins as well.
Next, let's look at Chinstrap penguins:
# Isolating Chinstrap Penguin Data
chin_data = df[df["Species"].str.contains("Chinstrap")]
chin_data.reset_index(inplace = True, drop = True)
#Isolating into male versus female Chinstrap penguins
male_data_chin = chin_data[chin_data["Sex"] == 0]
female_data_chin = chin_data[chin_data["Sex"] == 1]
#Plotting the data
fig, ax = plt.subplots(4,1, figsize = (15,20))
ax[0].hist([male_data_chin["Bill length (mm)"], female_data_chin["Bill length (mm)"]], color= ["blue", "pink"])
ax[0].set_xlabel("Bill Length (mm)")
ax[0].set_ylabel("Frequency")
ax[0].set_title("Histogram for Chinstrap Penguin bill length (mm) by gender")
ax[1].hist([male_data_chin["Bill depth (mm)"], female_data_chin["Bill depth (mm)"]], color= ["blue", "pink"])
ax[1].set_xlabel("Bill Depth (mm)")
ax[1].set_ylabel("Frequency")
ax[1].set_title("Histogram for Chinstrap Penguin bill depth (mm) by gender")
ax[2].hist([male_data_chin["Flipper length (mm)"], female_data_chin["Flipper length (mm)"]], color= ["blue", "pink"])
ax[2].set_xlabel("Flipper length (mm)")
ax[2].set_ylabel("Frequency")
ax[2].set_title("Histogram for Chinstrap Penguin flipper length (mm) by gender")
ax[3].hist([male_data_chin["Body mass (grams)"], female_data_chin["Body mass (grams)"]], color= ["blue", "pink"])
ax[3].set_xlabel("Body mass (grams)")
ax[3].set_ylabel("Frequency")
ax[3].set_title("Histogram for Chinstrap Penguin body mass (grams) by gender")
fig.savefig("chin_gender_diff.png")
These histograms look very similar to those produced by Adelie penguins, but still differ somewhat.
Regarding Chinstrap penguin bill lengths, it looks as if there seems to be an outlier in the female data, with one female penguin having a bill length of 57.5 mm. Aside from that, both distributions look somewhat symmetric. The peak in the female distribution occurs at 47.5 mm while the peak in male bill length occurs at around 50 mm. Just like in Adelie penguins, male Chinstrap penguins seem to have longer bill lengths than female Chinstrap penguins.
For bill depths, the female distribution looks somewhat bimodal, with two peaks at 16 mm and 18 mm, respectively. On the other hand, the male distribution looks more unimodal, with a peak at 20 mm. Given these peaks, it follows that it appears that on average, male bill depths tend to be longer than female bill depths, similar to what was seen in the Adelie penguins.
For flipper length, interestingly enough, both the male and female distributions appear to have peaks at the same measurement, at around 195 mm. However, the female distribution seems to be bimodal, with peaks at 185 mm and 195 mm, while the male distrbution appears more unimodal and centered at 195 mm. Stil, although both distributions have overlapping peaks at 195 mm, the ranges for the female and male distributions still appear to differ: the female flipper lengths range from 180 to 205 mm, while male flipper lengths range from 185 to 210 mm - thus, it still seems that on average, male flipper lengths tend to be longer than those of females.
Lastly, regarding body mass, similar to flipper length distributions, the male and female peak body mass in grams appear to occur in similar places, with the most common female body mass occuring at around 3700 grams, while the most common male body mass was around 3800 grams. Both male and female distributions appears to also be unimodal, with one peak. Therefore, it also appears that male penguins tended to have larger body masses than female penguins.
Next, let's look at gender breakdowns of Gentoo penguins:
# Isolating Gentoo Penguin Data
gent_data = df[df["Species"].str.contains("Gentoo")]
gent_data.reset_index(inplace = True, drop = True)
#Isolating into male versus female Gentoo penguins
male_data_gent = gent_data[gent_data["Sex"] == 0]
female_data_gent = gent_data[gent_data["Sex"] == 1]
#Plotting data
fig, ax = plt.subplots(4,1, figsize = (15,20))
ax[0].hist([male_data_gent["Bill length (mm)"], female_data_gent["Bill length (mm)"]], color= ["blue", "pink"])
ax[0].set_xlabel("Bill Length (mm)")
ax[0].set_ylabel("Frequency")
ax[0].set_title("Histogram for Gentoo Penguin bill length (mm) by gender")
ax[1].hist([male_data_gent["Bill depth (mm)"], female_data_gent["Bill depth (mm)"]], color= ["blue", "pink"])
ax[1].set_xlabel("Bill Depth (mm)")
ax[1].set_ylabel("Frequency")
ax[1].set_title("Histogram for Gentoo Penguin bill depth (mm) by gender")
ax[2].hist([male_data_gent["Flipper length (mm)"], female_data_gent["Flipper length (mm)"]], color= ["blue", "pink"])
ax[2].set_xlabel("Flipper length (mm)")
ax[2].set_ylabel("Frequency")
ax[2].set_title("Histogram for Gentoo Penguin flipper length (mm) by gender")
ax[3].hist([male_data_gent["Body mass (grams)"], female_data_gent["Body mass (grams)"]], color= ["blue", "pink"])
ax[3].set_xlabel("Body mass (grams)")
ax[3].set_ylabel("Frequency")
ax[3].set_title("Histogram for Gentoo Penguin body mass (grams) by gender")
fig.savefig("gent_gender_diff.png")
Looking at all the graphs together, it seems that Gentoo penguins may have the most differences between the male and female penguins. As seen in Adelie and Chinstrap penguins' graphs, there was a lot of marked overlap between the ranges of the male and female penguins' attributes, but in the case of the Gentoo penguin graphs, it seems that the ranges of the male and female attributes' obersevations seem to have significantly less overlap when compared to the other two species of penguin.
For the histogram of bill lengths, it appears that the female distribution is unimodal and symmetric, peaking at 45 mm, while the male distribution appears unimodal and left skewed, with a peak at around 48 mm. It also look as though the range of male bill lengths is much larger than that of females. Thus, we can conclude that the dataset suggests that for Gentoo penguins, male bill length appears to be, on average, longer than that of the females.
For bill depth, both the male and female distributions look more or less symmetric around their peaks. However, whilst the male distrubution is unimodal around 15.7 mm, it appears that the female stribution is somewhat bimodal, with peaks around 14 and 15 mm. This difference in peaks still suggest that male bill depths are on average longer than female bill depths.
Regarding flipper length, the female distribution seems to be unimodal and symmetric, with a peak at around 212 mm, while the male distribution seems to be unimodal and centered at 220 mm. Interestingly enough, the male distribution seems to be very left skewed, with a much larger range of flipper lengths than their female counterparts. This overlapped histogram seems to suggest that on average, male Gentoo penguin flipper lengths are both longer and more varied than those of female Gentoo penguins.
Finally, for the histogram displaying male and female body mass in grams, it seems that this histogram demonstrates the least amount of overlap, with both distributions occupying distinct regions of the graph. Namely, the range of female body mass is from 4000 g to 5000 g, whilst the range of male body mass spans 4750 g to 6000 g. However, both distributions look fairly symmetrically distributed, with one peak. The peak on the female distribution occurs at around 4750 g, and the peak for males is at around 5500 g, which suggests that on average, male Gentoo penguins tend to have a larger body mass than female Gentoo penguins.
Now that we've sufficiently analyzed differences between male and female penguins within their own respective species, let's take a look at how each species differs from each other overall. We can already see some differences between histograms (i.e Adelie male penguins had a typical bill length of 41 mm, while Gentoo male penguins have a typical bill length of 48 mm), but creating violin plots comparing each attribute of penguins (bill length/depth, flipper length, body mass) between species will also help illustrate this difference more clearly.
# Creates violin plot using seaborn with x-axis as Species and y-axis as measurements (bill length/depth, etc...)
fig1, ax1 = plt.subplots(2,2, figsize=(15, 15))
# Plots the violin plots
sns.violinplot(ax = ax1[0,0],
x=df["Species"],
y =df["Bill length (mm)"]).set_title('Violin Plot of Bill Lengths (mm) by Penguin Species')
sns.violinplot(ax = ax1[0,1],
x=df["Species"],
y =df["Bill depth (mm)"]).set_title('Violin Plot of Bill Depths (mm) by Penguin Species')
sns.violinplot(ax = ax1[1,0],
x=df["Species"],
y =df["Flipper length (mm)"]).set_title('Violin Plot of Flipper Lengths (mm) by Penguin Species')
sns.violinplot(ax = ax1[1,1],
x=df["Species"],
y =df["Body mass (grams)"]).set_title('Violin Plot of Body Mass (grams) by Penguin Species')
#fig1.delaxes(ax1[1,1])
fig1.savefig("bill_lengths_violin_plot.png")
Looking at the violin plot, it appears that the Adelie penguin distribution is more unimodal and symmetric around its center, whilst the Chinstrap and Gentoo penguin distrbutions resemble bimodal distributions. Gentoo bill length distribution also looks somewhat right skewed, to a degree. These bimodal qualities could likely be attributed to the differences in center noted when constructing histograms comparing male versus female data in between each species in the previous section. Overall, it seems that Adelie penguins tended to have shorter bills than Chinstrap and Gentoo penguins, with Chinstrap penguins on average having the longest bill lengths. Furthermore, looking at the overall lengths of the violin plots, the graph seems to suggest that Chinstrap and Gentoo penguins had more variation in bill length when compared to Adelie penguins.
This graph seem to be opposite of the violin plot of bill lengths. The Adelie and Chinstrap violin plots seem to be unimodally distributed and somewhat symmetric, while the Gentoo plot seems somewhat bimodal. Furthermore, it appears that Adelie and Chinstrap penguins tended to have larger bill depths (with medians of around 18.5 mm) compared to Gentoo penguins. It also looks like Adelie penguins tended to have the most variation in bill length, being the longest violin plot plotted. It is interesting that, taking into account the bill length plots, Adelie penguins tended to have shorter bill lengths but also longer bill depths, which could suggest a negative relationship between the two. We'll go more into detail about this in the next section.
Both Adelie and Chinstrap penguins have symmetric, unimodal distributions centered from 190 - 200 mm flipper lengths. In contrast, the Gentoo penguin distribution appears to be bimodal and centered at 220 mm. This suggets that Adelie and Chinstrap penguins tended to have shorter flippers, while Gentoo penguins had longer flippers. The lengths of the violin plots also suggest more variation in Adelie and Chinstrap flipper lengths when compared to Gentoo flipper lengths.
This violin plot looks quite similar to that of flipper lengths, which suggests that there could be a possible relationship beween the two. Again, Adelie and Chinstrap distrbutions look mostly symmetric and unimodal. Penguins belonging to either species tended to have lower body masses, at around 4000 grams. On the other hand, the Gentoo penguin distribution looks bimodal. Gentoo penguins tended to have higher body masses, at around 5000 grams. On an interesting note, it seems as though the consistently bimodal distributions for Gentoo penguins present in all four of these violin plots are likely due to differents between male and female Gentoo penguins, as observed earlier.
We've already analyzed gender differences in each species and interspecies differences in attributes. It was noted that a possible positive correlation could exist between certain attributes of penguins, such as between flipper length and body mass, given the similar violin plots. Thus, we've created a heatmap displaying possible relationships variables could have with each other, generated using seaborn.
# Uses seaborn to show a heatmap of all the variables with one another
sns.heatmap(df.corr(), annot = True)
plt.show()
This heatmap gives high values for a relationship between flipper length and body mass, which confirms earlier suspicions. The heatmap also gives a high negative correlation value between bill depth and flipper length, which is also consistent with the earlier observation that the violin plots for those two attributes seem to be opposites. It is also worth noting that the Sex variable seems to have negative correlations with every noted measurement (i.e. Bill length, depth, etc.). Recall earlier that in the dataset, female penguins were noted with a 1 in the column and males were noted with a 0. Furthermore, in all of the generated gender-specific histograms, females had lower measurements across the board when compared to their male counterparts. Thus, this negative correlation could be due to a general trend where, if a 1 was encountered (i.e. a female), measurements of bill length/depth, flipper length, and body mass seemed to decrease: hence, the noted negative correlation in the heatmap.
Moving on, since the heatmap notes a high correlation value of 0.87 between flipper length and body mass, it's worth conducting a linear regression analysis to attempt to visualize the data and how it could be linearly related. Furthermore, the heatmap also noted fairly high correlation values for flipper length and bill length, and also body mass and bill length, which I've also decided to investigate.
First, let's make scatterplots of all three relationships that we've decided to focus on:
# Separates resulting figure into four partitions
fig, ax = plt.subplots(2,2, figsize = (15,15))
# Plots body mass vs. flipper length
ax[0, 0].scatter(adelie_data["Body mass (grams)"], adelie_data["Flipper length (mm)"], color = "blue")
ax[0, 0].scatter(chin_data["Body mass (grams)"], chin_data["Flipper length (mm)"], color = "green")
ax[0, 0].scatter(gent_data["Body mass (grams)"], gent_data["Flipper length (mm)"], color = "red")
ax[0, 0].set_xlabel("Body Mass (grams)")
ax[0, 0].set_ylabel("Flipper length (mm)")
ax[0, 0].set_title("Body Mass vs. Flipper Length for All Penguins")
# Plots bill length vs. flipper length
ax[1, 0].scatter(adelie_data["Bill length (mm)"], adelie_data["Flipper length (mm)"], color = "blue")
ax[1, 0].scatter(chin_data["Bill length (mm)"], chin_data["Flipper length (mm)"], color = "green")
ax[1, 0].scatter(gent_data["Bill length (mm)"], gent_data["Flipper length (mm)"], color = "red")
ax[1, 0].set_xlabel("Bill length (mm)")
ax[1, 0].set_ylabel("Flipper length (mm)")
ax[1, 0].set_title("Bill Length vs. Flipper Length for All Penguins")
#Plots body mass vs. bill length
ax[0, 1].scatter(adelie_data["Body mass (grams)"], adelie_data["Bill length (mm)"], color = "blue")
ax[0, 1].scatter(chin_data["Body mass (grams)"], chin_data["Bill length (mm)"], color = "green")
ax[0, 1].scatter(gent_data["Body mass (grams)"], gent_data["Bill length (mm)"], color = "red")
ax[0, 1].set_xlabel("Body mass (grams)")
ax[0, 1].set_ylabel("Bill length (mm)")
ax[0, 1].set_title("Body Mass vs. Bill Length for All Penguins")
fig.delaxes(ax[1,1])
fig.savefig("bm_vs_bl.png")
Looking at this graph, most of the data for all three penguins seems to follow a positive linear relationship. That is, for all three species of penguin, as body mass increases, so does flipper length. As observed earlier, there seems to be a lot of similarity in measurements for Adelie and Chinstrap data for these two particular observations. However, there seems to be a clear separation between observations for Adelie/Chinstrap penguins and Gentoo penguins.
All three species of penguin data still appear to follow a positive linear relationship. As body mass increases, bill length tends to increase. However, it appears that when compared to their Adelie and Gentoo counterparts, Chinstrap penguins at the same level of body mass tended to have higher bill lengths (shown by the green cluster placed above the blue and red clusters) Similarly, in the first graph, Adelie and Chinstrap data seemed to be interspersed with each other, but in this scatterplot, each species' data seems to occupy separate parts of the graph. When conducting a linear regression analysis, it looks like including an interaction term between species and body mass could lead to better results.
Observing overall trends, as bill length increased, flipper length also tended to increase. Similar to what was observed in the scatterplot showing body mass vs. bill length, there are three marked clusters of data, and Chinstrap penguins seem to occupy a separate area of the scatterplot entirely. Chinstrap penguins tended to have lower flipper lengths when compared to Adelie and Gentoo penguins with the same bill lengths.
results = smf.ols("Q('Flipper length (mm)') ~ Q('Species')*Q('Body mass (grams)')", data=df).fit()
print("Statsmodel Linear Regression Results: ")
print(results.summary())
Statsmodel Linear Regression Results:
OLS Regression Results
====================================================================================
Dep. Variable: Q('Flipper length (mm)') R-squared: 0.856
Model: OLS Adj. R-squared: 0.854
Method: Least Squares F-statistic: 389.9
Date: Mon, 20 Dec 2021 Prob (F-statistic): 2.08e-135
Time: 20:29:37 Log-Likelihood: -1028.1
No. Observations: 333 AIC: 2068.
Df Residuals: 327 BIC: 2091.
Df Model: 5
Covariance Type: nonrobust
====================================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------------------------
Intercept 165.6032 3.619 45.756 0.000 158.483 172.723
Q('Species')[T.Chinstrap] -14.2224 7.339 -1.938 0.053 -28.660 0.215
Q('Species')[T.Gentoo] 4.0640 6.195 0.656 0.512 -8.123 16.251
Q('Body mass (grams)') 0.0066 0.001 6.820 0.000 0.005 0.009
Q('Species')[T.Chinstrap]:Q('Body mass (grams)') 0.0053 0.002 2.704 0.007 0.001 0.009
Q('Species')[T.Gentoo]:Q('Body mass (grams)') 0.0027 0.001 1.978 0.049 1.54e-05 0.005
==============================================================================
Omnibus: 2.455 Durbin-Watson: 1.756
Prob(Omnibus): 0.293 Jarque-Bera (JB): 2.258
Skew: -0.122 Prob(JB): 0.323
Kurtosis: 3.322 Cond. No. 1.38e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.38e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
We've chosen to include an interaction term between species and body mass, and due to the nature of statsmodels, the Adelie penguin column has been dropped. This is due to the way statsmodels deals with categorical variables. However, this means that we can interpret the coefficient of body mass (0.0066) as the interaction term between Adelie penguins and body mass.
Putting it all together, when the species is Chinstrap, the flipper length increases 0.0053 times as much as compared to when the species is Adelie. Likewise, when the species is Gentoo, the flipper length increases 0.0027 times as much as compared to when the species is Adelie. The interaction term between Gentoo penguins and body mass being much smaller than that of Chinstrap penguins makes sense - as seen in the scatter plot, Gentoo data appeared to be interspersed with Adelie data, which is consistent with this smaller coefficient.
Looking at the P > t column of the statsmodel results summary, the only two coefficients that have p-values greater than an alpha of 0.05 are Q('Species')[T.Chinstrap] and Q('Species')[T.Gentoo]. However, overall, most of the coefficients in the model are signifacntly different from 0.
From the results, it appears that the R^2 value for this linear regression model is 0.856, which is close to 1 and suggests that there is evidence of a strong linear relationship between flipper length and body mass.
To confirm this, let's look at the violin plot residuals from this model:
# Gets the residuals from the statsmodel OLS
interaction_model_resids = results.resid
# Plots violin plot of residuals by species
fig, ax = plt.subplots()
sns.violinplot(x=df["Species"],
y =interaction_model_resids).set_title('Residuals of Linear Regression between Body Mass and Flipper Length')
Text(0.5, 1.0, 'Residuals of Linear Regression between Body Mass and Flipper Length')
This violin plot of residuals by penguin species matches assumptions of the linear regression model well. Each of the violin plots is unimodal, and also clustered around the center, where the median is roughly 0. The ranges of the residuals are also quite large, as seen by the larger range of Adelie penguins. Thus, we can say that this violin plot of residuals matches assumptions of the linear regression model well.
Let's move onto our next relationship between bill length and flipper length:
results = smf.ols("Q('Flipper length (mm)') ~ Q('Species')*Q('Bill length (mm)')", data=df).fit()
print("Statsmodel Linear Regression Results: ")
print(results.summary())
Statsmodel Linear Regression Results:
OLS Regression Results
====================================================================================
Dep. Variable: Q('Flipper length (mm)') R-squared: 0.831
Model: OLS Adj. R-squared: 0.829
Method: Least Squares F-statistic: 322.5
Date: Mon, 20 Dec 2021 Prob (F-statistic): 4.74e-124
Time: 20:29:37 Log-Likelihood: -1054.8
No. Observations: 333 AIC: 2122.
Df Residuals: 327 BIC: 2144.
Df Model: 5
Covariance Type: nonrobust
===================================================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------
Intercept 158.5047 7.038 22.520 0.000 144.659 172.351
Q('Species')[T.Chinstrap] -11.8689 12.545 -0.946 0.345 -36.548 12.810
Q('Species')[T.Gentoo] -8.2555 10.801 -0.764 0.445 -29.503 12.992
Q('Bill length (mm)') 0.8139 0.181 4.500 0.000 0.458 1.170
Q('Species')[T.Chinstrap]:Q('Bill length (mm)') 0.1934 0.279 0.694 0.488 -0.355 0.742
Q('Species')[T.Gentoo]:Q('Bill length (mm)') 0.5943 0.250 2.382 0.018 0.104 1.085
==============================================================================
Omnibus: 11.317 Durbin-Watson: 1.975
Prob(Omnibus): 0.003 Jarque-Bera (JB): 18.403
Skew: -0.203 Prob(JB): 0.000101
Kurtosis: 4.078 Cond. No. 2.32e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.32e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Like what we saw with our first linear regression, the Adelie penguin column has been dropped due to how statsmodel works. Thus we can interpret the coefficient of bill length (0.8139) as the interaction term between Adelie penguins and bill length.
Looking at all our interaction terms together, when the species is Chinstrap, the flipper length increases 0.1934 times as much as compared to when the species is Adelie. Likewise, when the species is Gentoo, the flipper length increases 0.5943 times as much as compared to when the species is Adelie.
Looking at the P > t column of the statsmodel results summary, there are three coefficients that have p-values greater than an alpha of 0.05: Q('Species')[T.Chinstrap] , Q('Species')[T.Gentoo] , and Q('Species')[T.Chinstrap]:Q('Bill length (mm)'). However, all other coefficients are significantly different from 0.
From the results, it appears that the R^2 value for this linear regression model is 0.831, which is close to 1 and suggests that there is evidence of a strong linear relationship between flipper length and bill length.
To confirm this, let's look at the violin plot residuals from this model:
# Gets the residuals from the statsmodel OLS
interaction_model_resids = results.resid
# Plots violin plot of residuals by year
fig, ax = plt.subplots()
sns.violinplot(x=df["Species"],
y =interaction_model_resids).set_title('Residuals of Linear Regression between Bill Length and Flipper Length')
Text(0.5, 1.0, 'Residuals of Linear Regression between Bill Length and Flipper Length')
Just like the first plot of residuals, this one looks similar. All of the plots are unimodal and centered around zero, affirming our previous idea that a linear regression model would be appropriate for the data.
Let's move onto our next relationship between bill length and flipper length:
results = smf.ols("Q('Bill length (mm)') ~ Q('Species')*Q('Body mass (grams)')", data=df).fit()
print("Statsmodel Linear Regression Results: ")
print(results.summary())
Statsmodel Linear Regression Results:
OLS Regression Results
=================================================================================
Dep. Variable: Q('Bill length (mm)') R-squared: 0.808
Model: OLS Adj. R-squared: 0.805
Method: Least Squares F-statistic: 275.3
Date: Mon, 20 Dec 2021 Prob (F-statistic): 7.30e-115
Time: 20:29:37 Log-Likelihood: -762.97
No. Observations: 333 AIC: 1538.
Df Residuals: 327 BIC: 1561.
Df Model: 5
Covariance Type: nonrobust
====================================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------------------------
Intercept 27.1129 1.632 16.609 0.000 23.902 30.324
Q('Species')[T.Chinstrap] 5.0613 3.310 1.529 0.127 -1.451 11.573
Q('Species')[T.Gentoo] -0.5750 2.794 -0.206 0.837 -6.072 4.922
Q('Body mass (grams)') 0.0032 0.000 7.228 0.000 0.002 0.004
Q('Species')[T.Chinstrap]:Q('Body mass (grams)') 0.0013 0.001 1.475 0.141 -0.000 0.003
Q('Species')[T.Gentoo]:Q('Body mass (grams)') 0.0010 0.001 1.558 0.120 -0.000 0.002
==============================================================================
Omnibus: 5.384 Durbin-Watson: 2.252
Prob(Omnibus): 0.068 Jarque-Bera (JB): 5.657
Skew: 0.207 Prob(JB): 0.0591
Kurtosis: 3.487 Cond. No. 1.38e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.38e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Thus we can interpret the coefficient of body mass length (0.0032) as the interaction term between Adelie penguins and bill length.
Looking at all our interaction terms together, when the species is Chinstrap, the bill length increases 0.0013 times as much as compared to when the species is Adelie. Likewise, when the species is Gentoo, the flipper length increases 0.0010 times as much as compared to when the species is Adelie.
Looking at the P > t column of the statsmodel results summary, there are many coefficients that have p-values greater than an alpha of 0.05. This could suggest that most of the data follows a general overall trend and an interaction term between Species and Body Mass wasn't necessarily needed.
The R^2 value for this linear regression model is 0.808, which is close to 1 and suggests that there is evidence of a strong linear relationship between body mass and bill length.
To confirm this, let's look at the violin plot residuals from this model:
# Gets the residuals from the statsmodel OLS
interaction_model_resids = results.resid
# Plots violin plot of residuals by year
fig, ax = plt.subplots()
sns.violinplot(x=df["Species"],
y =interaction_model_resids).set_title('Residuals of Linear Regression between Body Mass and Bill Length')
Text(0.5, 1.0, 'Residuals of Linear Regression between Body Mass and Bill Length')
All of the plots are unimodal and centered around zero, affirming our previous idea that a linear regression model would be appropriate for the data.
After including an interaction term between a penguin's species and another independent variable, it was found that the data supported evidence of a strong linear relationship between bill length vs. flipper length, body mass vs. flipper length, and body mass vs. bill length, as shown by relatively high R^2 values around 0.8 when we conducted a linear regression analysis. To that end, we can conclude that increases in bill length are correlated with increases in flipper length, and likewise similar positive correlations with the other two relationships investigated.
It is interesting that although the heatmap gave different correlation values for all three relationships, most of the R^2 values generated were around the same number. This is likely due to the inclusion of the interaction term included for Species. In the scatterplots shown earlier, some species of penguin had observations in a different part of the graph, and so by accounting for these species differences with the interaction term, the linear regression models were able to attain similar levels of accuracy.
Next, let's move onto machine learning with our dataset!
We can use powerful machine learning algorithms to analyze our penguin dataset. In particular, this project will be investigating how three machine learning algorithms for classification: k-NN, Decision Trees, and Random Forests perform on our dataset. We will be using accuracy to score the three previously identified algorithms. This is because we are particularly interested in the percentage of correct predictions of penguin species made based on our predictors.
The hyperparameters chosen to focus on were n_neighbors and weights.
First a grid search was performed on n_neighbors/weights due to the large impact both can have our model - one of the most important thing for k-Nearest Neighbors is determining the k (i.e. how many nearest neighbors there can be), and the weights parameter can influence whether points closer to the query point have less or greater influence, which would also have a large impact on the model. After performing grid search, it was determined that the k should be set to 1, and weights should be uniform.
## k-NN Classification
# Creates X by dropping the species values and aggregating the rest of the columns into arrays, one array for
# each observed penguin.
# That is, our predictors will be the other columns like bill length, bill depth, flipper length, body mass, and sex
X = df.drop(columns = "Species").values
# Creates y by aggregating all species values into an array
# That is, our targets will be the species names
y = df["Species"].values
#Creates dictionary of parameters
params = {"n_neighbors": [1,2,3,4,5,6,7,8,9,10], "weights" : ["uniform", "distance"]}
k_neighbors = KNeighborsClassifier()
grid_search = GridSearchCV(estimator=k_neighbors, param_grid=params, cv = 10, scoring = 'accuracy')
clf = grid_search.fit(X,y)
#Prints the best combination of parameters
print("k-NN best combination of parameters: ")
print(clf.best_params_)
print()
clf = KNeighborsClassifier(n_neighbors=1, weights="uniform")
scores = cross_val_score(clf, X, y, cv=10, scoring = 'accuracy')
print("k-NN Classification has %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
k-NN best combination of parameters:
{'n_neighbors': 1, 'weights': 'uniform'}
k-NN Classification has 0.86 accuracy with a standard deviation of 0.05
After performing grid search and 10-fold cross validation, our k-NN classification has a mean 0.86 accuracy with a standard deviation of 0.05. Dividing our standard deviation by sqrt(10) to find the standard error would give a standard error of 0.0158. This means that on average, our k-NN classification algorithm correctly classified penguins into their species with an accuracy rate of 86%. This number isn't low, but let's run the rest of our classfication algorithms to get a better idea of their accuracy rates as well:
In this case, we chose to focus on criterion, splitter, max_features, and max_depth as hyperparameters.
First, criterion could be calculated with either gini or entropy. Since entropy and information gain were explored in class, we wanted to examine if the criterion for calculating how "good" a split was had a large impact on the model's accuracy, thus, we chose to include these in the dictionary of hyperparameters to see if calculating a split based off of gini would be better than using entropy, or vice versa.
The choice of splitter was also considered, as choosing the best split versus choosing the split randomly could have an effect on the accuracy of the model.
Max_depth and max_features were also focused on due to their potential for overfitting (and thus, the subsequent use of GridSearchCV to minimize overfitting).
After running a grid search on these parameters, it was determined that the entropy criterion should be used to determine the best subtrees for splitting, the best split would be chosen, sqrt(n_features) would be considered when splitting, and the max_depth would be 4.
## Decision Tree (Classification)
# Makes list of column names and target (species) names, to be used when we generate an image of the overall decision tree
col_names = df.drop(columns = "Species").columns
target_names = df["Species"].values
target_names = list(set(target_names))
# Creates dictionary of parameter values
params = {"criterion": ["gini", "entropy"], "splitter" : ["best", "random"], "max_depth" : [1,2,3,4,5,6,7,8,9,10],
"max_features" : ["auto", "sqrt", "log2"]}
decision_tree = DecisionTreeClassifier(random_state=0)
# Performs grid search
grid_search = GridSearchCV(estimator=decision_tree, param_grid=params, cv = 10, scoring = 'accuracy')
clf = grid_search.fit(X,y)
#Prints the best combination of parameters
print("Decision Tree best combination of parameters: ")
print(clf.best_params_)
print()
# Performs 10-fold cross validation with the identified best parameters
clf = DecisionTreeClassifier(criterion = "entropy", random_state=0, splitter='best', max_features = 'auto', max_depth = 4)
# Using accuracy as the scoring metric
scores = cross_val_score(clf, X, y, cv=10, scoring = 'accuracy')
clf.fit(X,y)
# Prints the mean and standard deviation accurary of decision tree classification
print("Decision Tree has %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
fig = plt.figure(figsize=(25,20))
tree.plot_tree(clf,
feature_names=col_names,
class_names=target_names,
filled=True)
fig.savefig("decistion_tree.png")
Decision Tree best combination of parameters:
{'criterion': 'entropy', 'max_depth': 4, 'max_features': 'auto', 'splitter': 'best'}
Decision Tree has 0.95 accuracy with a standard deviation of 0.02
The decision tree has also been printed above.
After performing grid search and 10-fold cross validation, it was caclulated that the decision tree had a mean 0.95 accuracy with a standard deviation of 0.02. Dividing our standard deviation by sqrt(10) to find the standard error would give a standard error of 0.00632. This means that on average, our decision tree classification algorithm correctly classified penguins into their species with an accuracy rate of 95%. This number is quite high and close to 100%, which increases our confidence in using this decision tree to classify penguins in the future.
However, let's proceed with random forest classification to see if we can get a higher accuracy rate:
In this case, we chose to focus on criterion, max_depth, and max_features as hyperparameters, similar to what was done for our Decision Tree classfier.
First, as mentioned before, criterion was included in our dictionary of hyperparameters to see if calculating a split based off of gini would be better than using entropy, or vice versa.
Max_depth and max_features were also included in the grid search due to their potential for overfitting (as large values could lead to poorer accuracy).
Thus, after running a grid search on these parameters, it was determined that the entropy criterion should be used to determine the best subtrees for splitting, sqrt(n_features) would be considered when splitting, and the max_depth would be 7. This is interesting as the max_depth calculated for our decision tree was 4 instead, but this likely makes sense due to how decision trees work: i.e. it generates many decision trees based off randomly selected attributes and aggregates them together for a final result.
## Random Forest (Classification)
# Creates dictionary of parameters
params = {"criterion": ["gini", "entropy"], "max_depth" : [1,2,3,4,5,6,7,8],
"max_features" : ["auto", "sqrt", "log2"]}
random_forest = RandomForestClassifier(random_state=0)
# Performs grid search
grid_search = GridSearchCV(estimator=random_forest, param_grid=params, cv = 10, scoring = 'accuracy')
clf = grid_search.fit(X,y)
#Prints the best combination of parameters
print("Random Forest best combination of parameters: ")
print(clf.best_params_)
print()
# Performs 10-fold cross validation with the identified best parameters
clf = RandomForestClassifier(criterion = "entropy", n_estimators = 100, random_state=0, max_features = 'auto', max_depth = 7)
# Using accuracy as the scoring metric
scores = cross_val_score(clf, X, y, cv=10, scoring = 'accuracy')
# Prints the mean and standard deviation accurary of decision tree classification
print("Random Forest has %0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
Random Forest best combination of parameters:
{'criterion': 'entropy', 'max_depth': 7, 'max_features': 'auto'}
Random Forest has 0.98 accuracy with a standard deviation of 0.02
After performing grid search and 10-fold cross validation, it was calculated that the decision tree had a mean 0.98 accuracy with a standard deviation of 0.02. Dividing our standard deviation by sqrt(10) to find the standard error would give a standard error of 0.00632. This means that on average, our random forest classification algorithm correctly classified penguins into their species with an accuracy rate of 98%. This number is quite high and close to 100%, which increases our confidence in using a random forest to classify penguins in the future.
After running three machine learning classification algorithms on the penguin dataset, our k-NN algorithm had an accuracy rate of 0.86 with a standard error of 0.0158, the decision tree had an accuracy of 0.95 with a standard error of 0.00632, and the random forest had an accuracy of 0.98 with a standard error of 0.00632. Among all three algorithms we ran, the random forest seems to have the highest accuracy rate and lowest standard error. This higher accuracy rate is likely due to how random forests are calculated, as instead of creating one decision tree (as seen in our decision tree algorithm), multiple are created and then aggregated.
After doing our exploratory data analysis and machine learning, we can conclude quite a lot from this penguin dataset. From our intraspecies gender analysis using overlapping histograms, it was found that across all three species of penguin, females tended to have shorter bill lengths, bill depths, flipper lengths, and body mass when compared to males. Furthermore, after lookinng at the interspecies violin plots of attributes, it was noted that Adelie penguins tended to have shorter bill lengths and flipper lengths when compared to Chinstrap and Gentoo penguins. Conversely, Gentoo penguins appeared to have the longest bill lengths, flipper lengths, and body mass when compared to Adelie and Chinstrap penguins.
After generating scatterplots and conducting linear regression analysis, it was found that strong evidence of a linear relationship exists for the following relationships: bill length and flipper length, body mass and bill length, and body mass and flipper length. These linear regression lines were calculated using an interaction term between a penguin's species and the other independent variable, as it was noted on the scatterplots that Chinstrap penguins tended to have longer bill lengths at the body mass level of Adelie and Gentoo penguins.
Finally, after conducting machine learning algorithms on the data, it was found that after using k-NN, Decision Trees, and Random Forests, random forests had the highest accuracy rate, at around 0.98.
This project was truly a valuable exercise in learning how to use datasets, clean them, and analyze them for valuable solutions. It highlighted how important data science can be to all areas of life, but in this case, especially towards animal conservation efforts and animal scientists. Studying data on animals and drawing conclusions about correlations between different animal attributes could lead scientists to discover more effective ways to treat animals in cases of illness. Furthermore, such knowledge could even help aid wildlife conservation efforts in the future. For future investigations, it would be interesting to study data on penguin-related disease infection rates and attempt to draw further conclusions using even more data.
If you're interested in reading up more on topics related to this notebook, here are some resources I've found helpful: