"In this lesson, we'll review the dictionary features and learn about the CSV data file format. By the end of this lesson, students will be able to:\n",
"\n",
"- Identify the list of dictionaries corresponding to some CSV data.\n",
"- Loop over a list of dictionaries (CSV rows) and access dictionary values (CSV columns)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77612e82-7c52-4a84-9de7-e30d90011783",
"metadata": {},
"outputs": [],
"source": [
"import doctest"
]
},
{
"cell_type": "markdown",
"id": "6859110d-2774-44a0-9dba-7751100c1481",
"metadata": {},
"source": [
"## Review: Dictionary functions\n",
"\n",
"Dictionaries, like lists, are also mutable data structures so they have functions to help store and retrieve elements.\n",
"\n",
"- `d.pop(key)` removes `key` from `d`.\n",
"- `d.keys()` returns a collection of all the keys in `d`.\n",
"- `d.values()` returns a collection of all the values in `d`.\n",
"- `d.items()` returns a collection of all `(key, value)` tuples in `d`.\n",
"\n",
"There are different ways to loop over a dictionary."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ca794c6-2d26-4cce-8096-50070ecb0bb3",
"metadata": {},
"outputs": [],
"source": [
"dictionary = {\"a\": 1, \"b\": 2, \"c\": 3}\n",
"for key in dictionary:\n",
" print(key, dictionary[key])"
]
},
{
"cell_type": "markdown",
"id": "c806f16e-f0c1-45a8-9e91-38114b94af42",
"metadata": {},
"source": [
"## None in Python\n",
"\n",
"In the lesson on File Processing, we saw a function to count the occurrences of each token in a file as a `dict` where the keys are words and the values are counts.\n",
"\n",
"Let's **debug** the following function `most_frequent` that takes a dictionary as *input* and returns the word with the highest count. If the input were a list, we could index the zero-th element from the list and loop over the remaining values by slicing the list. But it's harder to do this with a dictionary.\n",
"\n",
"Python has a special `None` keyword, like `null` in Java, that represents a placeholder value."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "084590d8-f37f-4852-bc76-f8ebbb0a7b7b",
"metadata": {},
"outputs": [],
"source": [
"def most_frequent(counts):\n",
" \"\"\"\n",
" Returns the token in the given dictionary with the highest count, or None if empty.\n",
"When we need keys and values, we can loop over and unpack each key-value pair by looping over the `dictionary.items()`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7791f854-9d73-4a11-aad4-cf2b8a07b7f8",
"metadata": {},
"outputs": [],
"source": [
"dictionary = {\"a\": 1, \"b\": 2, \"c\": 3}\n",
"for key, value in dictionary.items():\n",
" print(key, value)"
]
},
{
"cell_type": "markdown",
"id": "abf82976-1a50-4125-9920-fff642782996",
"metadata": {},
"source": [
"Loop unpacking is not only useful for dictionaries, but also for looping over other sequences such as `enumerate` and `zip`. `enumerate` is a built-in function that takes a sequence and returns another sequence of pairs representing the element index and the element value."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "735741f6-b5df-4548-8b23-26ca0bd67cb0",
"metadata": {},
"outputs": [],
"source": [
"with open(\"poem.txt\") as f:\n",
" for i, line in enumerate(f.readlines()):\n",
" print(i, line[:-1])"
]
},
{
"cell_type": "markdown",
"id": "b55e4520-de85-40af-9dcf-da497e0fa675",
"metadata": {},
"source": [
"`zip` is another built-in function that takes one or more sequences and returns a *sequence of tuples* consisting of the first element from each given sequence, the second element from each given sequence, etc. If the sequences are not all the same length, `zip` stops after yielding all elements from the shortest sequence."
"for arabic, alpha, roman in zip(arabic_nums, alpha_nums, roman_nums):\n",
" print(arabic, alpha, roman)"
]
},
{
"cell_type": "markdown",
"id": "d994834c-2bf7-4ab4-ba85-8d32d1cc89be",
"metadata": {},
"source": [
"## Comma-separated values\n",
"\n",
"In data science, we often work with tabular data such as the following table representing the names and hours of some of our TAs.\n",
"\n",
"Name | Hours\n",
"-----|-----:\n",
"Diana | 10\n",
"Thrisha | 15\n",
"Yuxiang | 20\n",
"Sheamin | 12\n",
"\n",
"A **table** has two main components to it:\n",
"\n",
"- **Rows** corresponding to each entry, such as each individual TA.\n",
"- **Columns** corresponding to (required or optional) fields for each entry, such as TA name and TA hours.\n",
"\n",
"A **comma-separated values** (CSV) file is a particular way of representing a table using only plain text. Here is the corresponding CSV file for the above table. Each row is separated with a newline. Each column is separated with a single comma `,`.\n",
"\n",
"```\n",
"Name,Hours\n",
"Diana,10\n",
"Thrisha,15\n",
"Yuxiang,20\n",
"Sheamin,12\n",
"```\n",
"\n",
"We'll learn a couple ways of processing CSV data in this course, first of which is representing the data as a **list of dictionaries**."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "199d463e-9ecd-46c1-aacc-1bdffae1c00c",
"metadata": {},
"outputs": [],
"source": [
"staff = [\n",
" {\"Name\": \"Yuxiang\", \"Hours\": 20},\n",
" {\"Name\": \"Thrisha\", \"Hours\": 15},\n",
" {\"Name\": \"Diana\", \"Hours\": 10},\n",
" {\"Name\": \"Sheamin\", \"Hours\": 12},\n",
"]\n",
"staff"
]
},
{
"cell_type": "markdown",
"id": "8ec2eeac-0253-4bc3-b5b9-ec708d56a425",
"metadata": {},
"source": [
"To see the total number of TA hours available, we can loop over the list of dictionaries and sum the \"Hours\" value."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55dc7fda-d50c-4a74-8771-6ae1154bc0e8",
"metadata": {},
"outputs": [],
"source": [
"total_hours = 0\n",
"for ta in staff:\n",
" total_hours += ta[\"Hours\"]\n",
"total_hours"
]
},
{
"cell_type": "markdown",
"id": "6e76d800-bf01-4a9a-8578-1f187486fbd3",
"metadata": {},
"source": [
"What are some different ways to get the value of Thrisha's hours?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "930fd021-24ea-4d57-b0bd-762eccf3a88c",
"metadata": {},
"outputs": [],
"source": [
"for ta in staff:\n",
" if ta[\"Name\"] == \"Thrisha\":\n",
" print(ta[\"Hours\"])"
]
},
{
"cell_type": "markdown",
"id": "f8b8eb32",
"metadata": {},
"source": [
"Poll Question: select the right option"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d34d58f",
"metadata": {},
"outputs": [],
"source": [
"staff[1][\"Hours\"]\n",
"staff[\"Hours\"][1]\n",
"staff[\"Thrisha\"][\"Hours\"]\n",
"staff[\"Hours\"][\"Thrisha\"]"
]
},
{
"cell_type": "markdown",
"id": "f84463d4",
"metadata": {},
"source": [
"## Reading CSV files using Python's built-in csv package\n",
"Suppose we have a dataset of earthquakes around the world stored in the CSV file `earthquakes.csv`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53ae2878",
"metadata": {},
"outputs": [],
"source": [
"import csv"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d631d4a9",
"metadata": {},
"outputs": [],
"source": [
"earthquakes = []\n",
"with open(\"materials/earthquakes.csv\") as f:\n",
" reader = csv.DictReader(f)\n",
" for row in reader:\n",
" earthquakes.append(row)\n",
"earthquakes[:5]"
]
},
{
"cell_type": "markdown",
"id": "f87d17f0",
"metadata": {},
"source": [
"`csv.DictWriter` also exists; you can do the following to write a row into a csv file:\n",
"- `writeheader()`: Write a row with the field names (as specified in the constructor) to the writer’s file object.\n",
"- `writerow(row)` or `writerows(rows)`: Write the row/rows parameter to the writer’s file object.\n",
"\n",
"Here, `row` is a dictionary and `rows` is a list of dictionaries."
]
},
{
"cell_type": "markdown",
"id": "12e0a542-abb4-4583-8abe-af435c250162",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"source": [
"## Practice: Largest earthquake place\n",
"\n",
"Write a function `largest_earthquake_place` that takes the earthquake `data` represented as a list of dictionaries and returns the name of the location that experienced the largest earthquake. If there are no rows in the dataset (no data at all), return `None`.\n",
"\n",
"id | year | month | day | latitude | longitude | name | magnitude\n",
In the lesson on File Processing, we saw a function to count the occurrences of each token in a file as a `dict` where the keys are words and the values are counts.
Let's **debug** the following function `most_frequent` that takes a dictionary as *input* and returns the word with the highest count. If the input were a list, we could index the zero-th element from the list and loop over the remaining values by slicing the list. But it's harder to do this with a dictionary.
Python has a special `None` keyword, like `null` in Java, that represents a placeholder value.
Loop unpacking is not only useful for dictionaries, but also for looping over other sequences such as `enumerate` and `zip`. `enumerate` is a built-in function that takes a sequence and returns another sequence of pairs representing the element index and the element value.
`zip` is another built-in function that takes one or more sequences and returns a *sequence of tuples* consisting of the first element from each given sequence, the second element from each given sequence, etc. If the sequences are not all the same length, `zip` stops after yielding all elements from the shortest sequence.
In data science, we often work with tabular data such as the following table representing the names and hours of some of our TAs.
Name | Hours
-----|-----:
Diana | 10
Thrisha | 15
Yuxiang | 20
Sheamin | 12
A **table** has two main components to it:
-**Rows** corresponding to each entry, such as each individual TA.
-**Columns** corresponding to (required or optional) fields for each entry, such as TA name and TA hours.
A **comma-separated values** (CSV) file is a particular way of representing a table using only plain text. Here is the corresponding CSV file for the above table. Each row is separated with a newline. Each column is separated with a single comma `,`.
```
Name,Hours
Diana,10
Thrisha,15
Yuxiang,20
Sheamin,12
```
We'll learn a couple ways of processing CSV data in this course, first of which is representing the data as a **list of dictionaries**.
Write a function `largest_earthquake_place` that takes the earthquake `data` represented as a list of dictionaries and returns the name of the location that experienced the largest earthquake. If there are no rows in the dataset (no data at all), return `None`.
id | year | month | day | latitude | longitude | name | magnitude