Blog 0 -- Visualizing the Palmer Penguins dataset

This post demonstrates basic data manipulation with pandas and visualizations with the plotly package.

First, let’s import the requisite packages and the Palmer Penguins data.

import pandas as pd
from plotly import express as px

# read the palmer penguins dataset
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

Subsetting with pandas

Suppose we’re interested in how the body mass of Gentoo penguins differ by sex. By setting conditions within square brackets [], we can select a subset of the data frame.

# select rows where the value of "Species" is "Gentoo penguin (Pygoscelis papua)"
gentoos = penguins[penguins["Species"]=="Gentoo penguin (Pygoscelis papua)"]

# only keep rows with "Sex"=="MALE" or "FEMALE", to remove irregular/empty values
gentoos = gentoos[(gentoos["Sex"]=="MALE") | (gentoos["Sex"]=="FEMALE")]

Histogram with plotly

There are many ways to visualize this data. One way is to use a histogram using plotly’s histogram function.


fig = px.histogram(gentoos,
                   x = "Body Mass (g)",
                   color = "Sex",
                   barmode='overlay',
                   opacity=0.75,
                   title="Body mass of male and female Gentoo penguins",
                   color_discrete_sequence=["red", "blue"]
                  )

# display the figure
fig.show()

Here, we make use of the options in the px.histogram command to customize our visualization. For example, the options barmode='overlay' and opacity=0.75 are used to visualize the overlap between the groups.

Written on January 14, 2022