update lecture

d3ea2ad3 · Yuxuan Mei · d1357608 · d3ea2ad3
Commit d3ea2ad3 authored 9 months ago by Yuxuan Mei
--- a/data-visualization.ipynb
+++ b/data-visualization.ipynb
@@ -52,6 +52,14 @@
    "pokemon"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "6f97d69f",
+   "metadata": {},
+   "source": [
+    "Note how this table consists of individual records of pokemons as rows. This makes it a long-form table, and is good for seaborn plotting."
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "95083b7c-e36c-4ad8-86e9-1f249d6a6113",
@@ -93,7 +101,7 @@
    "sns.scatterplot(pokemon[~pokemon[\"Legendary\"]], x=\"Attack\", y=\"Defense\", ax=ax1)\n",
    "ax2.set_title(\"Legendary\")\n",
    "sns.scatterplot(pokemon[pokemon[\"Legendary\"]], x=\"Attack\", y=\"Defense\", ax=ax2)\n",
-    "fig.show()"
+    "plt.show()"
   ]
  },
  {
@@ -216,7 +224,7 @@
   "id": "0410e17b-20a0-4d21-b187-a3597c89a642",
   "metadata": {},
   "source": [
-    "Write a `seaborn` expression to create a line plot comparing the `Year` (x-axis) to the `Life_Expectancy` (y-axis) colored with `hue=\"Country\"`."
+    "Write a `seaborn` expression to create a line plot plotting the `Life_Expectancy` (y-axis) against the `Year` (x-axis) colored with `hue=\"Country\"`."
   ]
  },
  {
@@ -227,6 +235,14 @@
   "outputs": [],
   "source": []
  },
+  {
+   "cell_type": "markdown",
+   "id": "ae16270f",
+   "metadata": {},
+   "source": [
+    "Anything noticeable? What title can we give to this plot?"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "7429bf64-f808-44f5-b7a5-fe05cd7efbb1",
@@ -309,7 +325,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.8.13"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id:e60a9820-3375-4c4a-acc1-485be4be01e7 tags:

 # Data Visualization

 In this lesson, we'll learn two data visualization libraries `matplotlib` and `seaborn`. By the end of this lesson, students will be able to:

 - Skim library documentation to identify relevant examples and usage information.
 - Apply `seaborn` and `matplotlib` to create and customize relational and regression plots.
 - Describe data visualization principles as they relate the effectiveness of a plot.

 Just like how we like to import `pandas` as `pd`, we'll import `matplotlib.pyplot` as `plt` and `seaborn` as `sns`.

 **Seaborn** is a Python data visualization library based on matplotlib. Behind the scenes, seaborn uses matplotlib to draw its plots. When importing seaborn, it is recommended to call `sns.set_theme()` to apply the recommended seaborn visual style instead of the default matplotlib theme.

 %% Cell type:code id:5872c8e4-b2e0-4237-b1f1-f64dbd05ef5a tags:

 ``` python
 import matplotlib.pyplot as plt
 import pandas as pd
 import seaborn as sns

 sns.set_theme()
 ```

 %% Cell type:markdown id:e748d5dd-da19-4d07-abda-e965f6a5979c tags:

 Let's load this uniquely-formatted pokemon dataset.

 %% Cell type:code id:21541502-fd8f-45f7-85e2-b1f9e31af340 tags:

 ``` python
 pokemon = pd.read_csv("pokemon_viz.csv", index_col="Num")
 pokemon
 ```

+%% Cell type:markdown id:6f97d69f tags:
+
+Note how this table consists of individual records of pokemons as rows. This makes it a long-form table, and is good for seaborn plotting.
+
 %% Cell type:markdown id:95083b7c-e36c-4ad8-86e9-1f249d6a6113 tags:

 ## Figure-level versus axes-level functions

 One way to draw a scatter plot comparing every pokemon's `Attack` and `Defense` stats is by calling `sns.scatterplot`. Because [this plotting function has so many parameters](https://seaborn.pydata.org/generated/seaborn.scatterplot.html), it's good practice to specify **keyword arguments** that tell Python which argument should go to which parameter.

 %% Cell type:code id:88545227-07e6-4dcc-8723-a2555ffc59fd tags:

 ``` python
 sns.scatterplot(pokemon, x="Attack", y="Defense")
 ```

 %% Cell type:markdown id:a13251e4-c112-40ea-a593-87123fa78876 tags:

 The return type of `sns.scatterplot` is a matplotlib feature called **axes** that can be used to compose multiple plots into a single visualization. We can show two plots side-by-side by placing them on the same axes. For example, we could compare the attack and defense stats for two different groups of pokemon: not-`Legendary` and `Legendary`.

 %% Cell type:code id:1f1247ae-736d-4aee-be81-caf26f993780 tags:

 ``` python
 # Nested tuple unpacking!
 fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
 ax1.set_title("Not Legendary")
 sns.scatterplot(pokemon[~pokemon["Legendary"]], x="Attack", y="Defense", ax=ax1)
 ax2.set_title("Legendary")
 sns.scatterplot(pokemon[pokemon["Legendary"]], x="Attack", y="Defense", ax=ax2)
-fig.show()
+plt.show()
 ```

 %% Cell type:markdown id:4506e513-7ec4-4ff2-a952-febac3526a98 tags:

 Each problem in the plot above can be fixed manually by repeatedly editing and running the code until you get a satisfactory result, but it's a tedious process. Seaborn was invented to make our data visualization experience less tedious. Methods like `sns.scatterplot` are considered **axes-level functions** designed for interoperability with the rest of `matplotlib`, but they come at the cost of forcing you to deal with the tediousness of tweaking matplotlib.

 Instead, the recommended way to create plots in seaborn is to use **figure-level functions** like `sns.relplot` as in *relational plot*. Figure-level functions return specialized seaborn objects (such as `FacetGrid`) that are intended to provide more usable results without tweaking.

 %% Cell type:code id:ee05b224-9192-4784-8a6b-c375c4572ddc tags:

 ``` python
 sns.relplot(pokemon, x="Attack", y="Defense", col="Legendary")
 ```

 %% Cell type:markdown id:76fe1021-f8cf-423d-b22b-d8bd4f1cfd94 tags:

 By default, relational plots produce scatter plots but they can also produce line plots by specifying the keyword argument `kind="line"`.

 Alongside `relplot`, seaborn provides several other useful figure-level plotting functions:

 - [`relplot` for **relational plots**](https://seaborn.pydata.org/generated/seaborn.relplot.html), such as scatter plots and line plots.
 - [`displot` for **distribution plots**](https://seaborn.pydata.org/generated/seaborn.displot.html), such as histograms and kernel density estimates.
 - [`catplot` for **categorical plots**](https://seaborn.pydata.org/generated/seaborn.catplot.html), such as strip plots, box plots, violin plots, and bar plots.
 - [`lmplot` for **relational plots with a regression fit**](https://seaborn.pydata.org/generated/seaborn.lmplot.html), such as the scatter plot with regression fit below.

 When reading documentation online, it is important to remember that we will only use figure-level plots in this course because they are the recommended approach. On the [relative merits of figure-level functions](https://seaborn.pydata.org/tutorial/function_overview.html#relative-merits-of-figure-level-functions) in the seaborn documentation:

 > On balance, the figure-level functions add some additional complexity that can make things more confusing for beginners, but their distinct features give them additional power. The tutorial documentation mostly uses the figure-level functions, because they produce slightly cleaner plots, and we generally recommend their use for most applications. The one situation where they are not a good choice is when you need to make a complex, standalone figure that composes multiple different plot kinds. At this point, it’s recommended to set up the figure using matplotlib directly and to fill in the individual components using axes-level functions.

 %% Cell type:code id:72c0d643-b70f-4b3e-b5c8-b3b9693b7cb5 tags:

 ``` python
 sns.lmplot(pokemon, x="Attack", y="Defense", col="Legendary")
 ```

 %% Cell type:markdown id:555008e3-fe59-42f6-be38-4e1133022b1b tags:

 ## Customizing a `FacetGrid` plot

 `relplot`, `displot`, `catplot`, and `lmplot` all return a `FacetGrid`, a specialized seaborn object that represents a data visualization canvas. As we've seen above, a `FacetGrid` can put two plots side-by-side and manage their axes by removing the y-axis labels on the right plot because they are the same as the plot on the left.

 However, there are still many instances where we might want to customize a plot by changing labels or adding titles. We might want to create a bar plot to count the number of each type of pokemon.

 %% Cell type:code id:24c31e9c-f790-4993-a6fb-5c190b008367 tags:

 ``` python
 sns.catplot(pokemon, x="Type 1", kind="count")
 ```

 %% Cell type:markdown id:1767fe40-b82d-490b-9e8b-06732c9fe1a1 tags:

 The pokemon types on the x-axis are hardly readable, the y-axis label "count" could use capitalization, and the plot could use a title. To modify the attributes of a plot, we can assign the returned `FacetGrid` to a variable like `grid` and then call [`tick_params`](https://seaborn.pydata.org/generated/seaborn.FacetGrid.tick_params.html) or [`set`](https://seaborn.pydata.org/generated/seaborn.FacetGrid.set.html#seaborn.FacetGrid.set).

 %% Cell type:code id:50dc1652-2fc0-4517-b3d9-e6f3b75c4b9b tags:

 ``` python
 grid = sns.catplot(pokemon, x="Type 1", kind="count")
 grid.tick_params(axis="x", rotation=60)
 grid.set(title="Count of each primary pokemon type", xlabel="Primary Type", ylabel="Count")
 ```

 %% Cell type:markdown id:dcf574f0-c9b0-4807-9e1e-27b3ca786cb8 tags:

 ## Practice: Life expectancy versus health expenditure

 Seaborn includes a repository of [example datasets](https://github.com/mwaskom/seaborn-data) that we can load into a `DataFrame` by calling `sns.load_dataset`. Let's examine the [Life expectancy vs. health expenditure, 1970 to 2015](https://ourworldindata.org/grapher/life-expectancy-vs-health-expenditure?time=earliest..2015) dataset that combines two data sources:

 1. The Life expectancy at birth dataset from the [UN World Population Prospects](https://population.un.org/wpp/Download/) (2022): "For a given year, it represents the average lifespan for a hypothetical group of people, if they experienced the same age-specific death rates throughout their lives as the age-specific death rates seen in that particular year."
 1. The Health expenditure (2010 int.-$) dataset from [OECD.stat](https://stats.oecd.org/). "Per capita health expenditure and financing in OECD countries, measured in 2010 international dollars."

 %% Cell type:code id:124d7c48-32e2-4758-9b26-abe3955dde8a tags:

 ``` python
 life_expectancy = sns.load_dataset("healthexp", index_col=["Year", "Country"])
 life_expectancy
 ```

 %% Cell type:markdown id:0410e17b-20a0-4d21-b187-a3597c89a642 tags:

-Write a `seaborn` expression to create a line plot comparing the `Year` (x-axis) to the `Life_Expectancy` (y-axis) colored with `hue="Country"`.
+Write a `seaborn` expression to create a line plot plotting the `Life_Expectancy` (y-axis) against the `Year` (x-axis) colored with `hue="Country"`.

 %% Cell type:code id:88f160bc-6b1a-43e7-9bef-16f0d5af39da tags:

 ``` python
 ```

+%% Cell type:markdown id:ae16270f tags:
+
+Anything noticeable? What title can we give to this plot?
+
 %% Cell type:markdown id:7429bf64-f808-44f5-b7a5-fe05cd7efbb1 tags:

 ## What makes bad figures bad?

 In chapter 1 of *Data Visualization*, Kieran Hiely explains how data visualization is about communication and rhetoric.

 > While it is tempting to simply start laying down the law about what works and what doesn't, the process of making a really good or really useful graph cannot be boiled down to a list of simple rules to be followed without exception in all circumstances. The graphs you make are meant to be looked at by someone. The effectiveness of any particular graph is not just a matter of how it looks in the abstract, but also a question of who is looking at it, and why. An image intended for an audience of experts reading a professional journal may not be readily interpretable by the general public. A quick visualization of a dataset you are currently exploring might not be of much use to your peers or your students.

 %% Cell type:markdown id:ea1add13-9f55-45fe-b2fb-c56ced8cb5bf tags:

 ### Bad taste

 Kieran identifies three problems, the first of which is **bad taste**.

 <img style="max-width: 100%; max-height: 480px" alt="3-d horizontal bar chart comparing life expectancy across continents with Papyrus font and cute visual style" src="https://socviz.co/assets/ch-01-chartjunk-life-expectancy.png" />

 Kieran draws on Edward Tufte's principles (all quoted from Tufte 1983):

 - have a properly chosen format and design
 - use words, numbers, and drawing together
 - display an accessible complexity of detail
 - avoid content-free decoration, including chartjunk

 In essence, these principles amount to "an encouragement to maximize the 'data-to-ink' ratio." In practice, our plotting libraries like `seaborn` do a fairly good job of providing defaults that generally follow these principles.

 %% Cell type:markdown id:94db5027-751d-4e63-8671-24cc6c047b0f tags:

 ### Bad data

 The second problem is **bad data**, which can involve either cherry-picking data or presenting information in a misleading way.

 > In November of 2016, *The New York Times* reported on some research on people's confidence in the institutions of democracy. It had been published in an academic journal by the political scientist Yascha Mounk. The headline in the *Times* ran, "How Stable Are Democracies? ‘Warning Signs Are Flashing Red’” (Taub, 2016). The graph accompanying the article

 <img style="max-width: 100%; max-height: 480px" alt="6-way line plot comparing Percentage of people who say it is 'essential' to live in a democracy (New York Times)" src="https://socviz.co/assets/ch-01-democracy-nyt-version.png" />

 This plot is one that is well-produced, and that we could reproduce by calling `sns.relplot` like we learned above. The x-axis shows the decade of birth for people all surveyed in the research study.

 > [But] scholars who knew the World Values Survey data underlying the graph noticed something else. The graph reads as though people were asked to say whether they thought it was essential to live in a democracy, and the results plotted show the percentage of respondents who said "Yes", presumably in contrast to those who said "No". But in fact the survey question asked respondents to rate the importance of living in a democracy on a ten point scale, with 1 being "Not at all Important" and 10 being "Absolutely Important". The graph showed the difference across ages of people who had given a score of "10" only, not changes in the average score on the question. As it turns out, while there is some variation by year of birth, most people in these countries tend to rate the importance of living in a democracy very highly, even if they do not all score it as "Absolutely Important". The political scientist Erik Voeten redrew the figure based using the average response.

 <img style="max-width: 100%; max-height: 480px" alt="5-way line plot comparing by Erik Voeten showing Average importance of democracy for each Decade of birth" src="https://socviz.co/assets/ch-01-democracy-voeten-version-2.png" />

 %% Cell type:markdown id:12156797-f788-4f1d-9020-3a26f197d0fd tags:

 ### Bad perception

 The third problem is **bad perception**, which refers to how humans process the information contained in a visualization. Let's walk through section 1.3 on "[Perception and data visualization](https://socviz.co/lookatdata.html#perception-and-data-visualization)".