Skip to content
Snippets Groups Projects
Commit fd1354fd authored by Yuxuan Mei's avatar Yuxuan Mei
Browse files

add new lecture

parent cbbbe242
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:9a53ab8e-0a96-4209-b13e-449440b75b97 tags:
# CSV Data
In this lesson, we'll review the dictionary features and learn about the CSV data file format. By the end of this lesson, students will be able to:
- Identify the list of dictionaries corresponding to some CSV data.
- Loop over a list of dictionaries (CSV rows) and access dictionary values (CSV columns).
%% Cell type:code id:77612e82-7c52-4a84-9de7-e30d90011783 tags:
``` python
import doctest
```
%% Cell type:markdown id:6859110d-2774-44a0-9dba-7751100c1481 tags:
## Review: Dictionary functions
Dictionaries, like lists, are also mutable data structures so they have functions to help store and retrieve elements.
- `d.pop(key)` removes `key` from `d`.
- `d.keys()` returns a collection of all the keys in `d`.
- `d.values()` returns a collection of all the values in `d`.
- `d.items()` returns a collection of all `(key, value)` tuples in `d`.
There are different ways to loop over a dictionary.
%% Cell type:code id:6ca794c6-2d26-4cce-8096-50070ecb0bb3 tags:
``` python
dictionary = {"a": 1, "b": 2, "c": 3}
for key in dictionary:
print(key, dictionary[key])
```
%% Cell type:markdown id:c806f16e-f0c1-45a8-9e91-38114b94af42 tags:
## None in Python
In the lesson on File Processing, we saw a function to count the occurrences of each token in a file as a `dict` where the keys are words and the values are counts.
Let's **debug** the following function `most_frequent` that takes a dictionary as *input* and returns the word with the highest count. If the input were a list, we could index the zero-th element from the list and loop over the remaining values by slicing the list. But it's harder to do this with a dictionary.
Python has a special `None` keyword, like `null` in Java, that represents a placeholder value.
%% Cell type:code id:084590d8-f37f-4852-bc76-f8ebbb0a7b7b tags:
``` python
def most_frequent(counts):
"""
Returns the token in the given dictionary with the highest count, or None if empty.
>>> most_frequent({"green": 2, "eggs": 6, "and": 3, "yam": 2})
'eggs'
>>> most_frequent({}) # None is not displayed as output
"""
max_word = None
for word in counts:
if counts[word] > counts[max_word]:
max_word = word
return max_word
doctest.run_docstring_examples(most_frequent, globals())
```
%% Cell type:markdown id:0b8c4201-d35f-4517-abe6-a4deb955ec79 tags:
## Loop unpacking
When we need keys and values, we can loop over and unpack each key-value pair by looping over the `dictionary.items()`.
%% Cell type:code id:7791f854-9d73-4a11-aad4-cf2b8a07b7f8 tags:
``` python
dictionary = {"a": 1, "b": 2, "c": 3}
for key, value in dictionary.items():
print(key, value)
```
%% Cell type:markdown id:abf82976-1a50-4125-9920-fff642782996 tags:
Loop unpacking is not only useful for dictionaries, but also for looping over other sequences such as `enumerate` and `zip`. `enumerate` is a built-in function that takes a sequence and returns another sequence of pairs representing the element index and the element value.
%% Cell type:code id:735741f6-b5df-4548-8b23-26ca0bd67cb0 tags:
``` python
with open("poem.txt") as f:
for i, line in enumerate(f.readlines()):
print(i, line[:-1])
```
%% Cell type:markdown id:b55e4520-de85-40af-9dcf-da497e0fa675 tags:
`zip` is another built-in function that takes one or more sequences and returns a *sequence of tuples* consisting of the first element from each given sequence, the second element from each given sequence, etc. If the sequences are not all the same length, `zip` stops after yielding all elements from the shortest sequence.
%% Cell type:code id:3614855c-ed37-428a-a00a-e1212f17e4d7 tags:
``` python
arabic_nums = [ 1, 2, 3, 4, 5]
alpha_nums = ["a", "b", "c", "d", "e"]
roman_nums = ["i", "ii", "iii", "iv", "v"]
for arabic, alpha, roman in zip(arabic_nums, alpha_nums, roman_nums):
print(arabic, alpha, roman)
```
%% Cell type:markdown id:d994834c-2bf7-4ab4-ba85-8d32d1cc89be tags:
## Comma-separated values
In data science, we often work with tabular data such as the following table representing the names and hours of some of our TAs.
Name | Hours
-----|-----:
Diana | 10
Thrisha | 15
Yuxiang | 20
Sheamin | 12
A **table** has two main components to it:
- **Rows** corresponding to each entry, such as each individual TA.
- **Columns** corresponding to (required or optional) fields for each entry, such as TA name and TA hours.
A **comma-separated values** (CSV) file is a particular way of representing a table using only plain text. Here is the corresponding CSV file for the above table. Each row is separated with a newline. Each column is separated with a single comma `,`.
```
Name,Hours
Diana,10
Thrisha,15
Yuxiang,20
Sheamin,12
```
We'll learn a couple ways of processing CSV data in this course, first of which is representing the data as a **list of dictionaries**.
%% Cell type:code id:199d463e-9ecd-46c1-aacc-1bdffae1c00c tags:
``` python
staff = [
{"Name": "Yuxiang", "Hours": 20},
{"Name": "Thrisha", "Hours": 15},
{"Name": "Diana", "Hours": 10},
{"Name": "Sheamin", "Hours": 12},
]
staff
```
%% Cell type:markdown id:8ec2eeac-0253-4bc3-b5b9-ec708d56a425 tags:
To see the total number of TA hours available, we can loop over the list of dictionaries and sum the "Hours" value.
%% Cell type:code id:55dc7fda-d50c-4a74-8771-6ae1154bc0e8 tags:
``` python
total_hours = 0
for ta in staff:
total_hours += ta["Hours"]
total_hours
```
%% Cell type:markdown id:6e76d800-bf01-4a9a-8578-1f187486fbd3 tags:
What are some different ways to get the value of Thrisha's hours?
%% Cell type:code id:930fd021-24ea-4d57-b0bd-762eccf3a88c tags:
``` python
for ta in staff:
if ta["Name"] == "Thrisha":
print(ta["Hours"])
```
%% Cell type:markdown id:f8b8eb32 tags:
Poll Question: select the right option
%% Cell type:code id:5d34d58f tags:
``` python
staff[1]["Hours"]
staff["Hours"][1]
staff["Thrisha"]["Hours"]
staff["Hours"]["Thrisha"]
```
%% Cell type:markdown id:f84463d4 tags:
## Reading CSV files using Python's built-in csv package
Suppose we have a dataset of earthquakes around the world stored in the CSV file `earthquakes.csv`.
%% Cell type:code id:53ae2878 tags:
``` python
import csv
```
%% Cell type:code id:d631d4a9 tags:
``` python
earthquakes = []
with open("materials/earthquakes.csv") as f:
reader = csv.DictReader(f)
for row in reader:
earthquakes.append(row)
earthquakes[:5]
```
%% Cell type:markdown id:f87d17f0 tags:
`csv.DictWriter` also exists; you can do the following to write a row into a csv file:
- `writeheader()`: Write a row with the field names (as specified in the constructor) to the writer’s file object.
- `writerow(row)` or `writerows(rows)`: Write the row/rows parameter to the writer’s file object.
Here, `row` is a dictionary and `rows` is a list of dictionaries.
%% Cell type:markdown id:12e0a542-abb4-4583-8abe-af435c250162 tags:
## Practice: Largest earthquake place
Write a function `largest_earthquake_place` that takes the earthquake `data` represented as a list of dictionaries and returns the name of the location that experienced the largest earthquake. If there are no rows in the dataset (no data at all), return `None`.
id | year | month | day | latitude | longitude | name | magnitude
---|:----:|:-----:|:---:|---------:|----------:|------|---------:
nc72666881 | 2016 | 7 | 27 | 37.672 | -121.619 | California | 1.43
us20006i0y | 2016 | 7 | 27 | 21.515 | 94.572 | Burma | 4.9
nc72666891 | 2016 | 7 | 27 | 37.577 | -118.859 | California | 0.06
nc72666896 | 2016 | 7 | 27 | 37.596 | -118.995 | California | 0.4
nn00553447 | 2016 | 7 | 27 | 39.378 | -119.845 | Nevada | 0.3
For example, considering only the data shown above, the result would be `"Burma"` because it had the earthquake with the largest magnitude (4.9).
%% Cell type:code id:7da4b7e1-f224-4bd6-9cc6-f1c61bbb23ed tags:
``` python
def largest_earthquake_place(path):
"""
Returns the name of the place with the largest-magnitude earthquake in the specified CSV file.
>>> largest_earthquake_place("earthquakes.csv")
'Northern Mariana Islands'
"""
earthquakes = []
with open(path) as f:
reader = csv.DictReader(f)
for row in reader:
earthquakes.append(row)
# TODO: find the place with the largest-magnitude earthquake
...
doctest.run_docstring_examples(largest_earthquake_place, globals())
```
%% Cell type:markdown id:3339b1f0 tags:
Let's see another solution done with a library "pandas".
%% Cell type:code id:4fc35602 tags:
``` python
import pandas as pd
```
%% Cell type:code id:0ca0d6dc tags:
``` python
def largest_earthquake_place_pandas(path):
"""
Returns the name of the place with the largest-magnitude earthquake in the specified CSV file.
>>> largest_earthquake_place_pandas("materials/earthquakes.csv")
'Northern Mariana Islands'
"""
earthquakes = pd.read_csv(path)
return earthquakes.loc[earthquakes["magnitude"].idxmax()]["name"]
doctest.run_docstring_examples(largest_earthquake_place_pandas, globals())
```
%% Cell type:code id:08ccb960 tags:
``` python
earthquakes = pd.read_csv("materials/earthquakes.csv")
earthquakes.head()
```
%% Cell type:code id:142c7c7d tags:
``` python
type(earthquakes)
```
%% Cell type:code id:62988d32 tags:
``` python
# play with type()...
```
%% Cell type:code id:f8554769 tags:
``` python
```
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment