add fri lecture

9cf07ddb · Yuxuan Mei · 59bb1a48 · 9cf07ddb
Commit 9cf07ddb authored 9 months ago by Yuxuan Mei
--- a/model-evaluation.ipynb
+++ b/model-evaluation.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c055d3c2-a5cb-41f5-8f22-fee73f12c96f",
+   "metadata": {},
+   "source": [
+    "# Model Evaluation\n",
+    "\n",
+    "In this lesson, we'll learn how to evaluate the quality of a machine learning model. By the end of this lesson, students will be able to:\n",
+    "\n",
+    "- Apply `get_dummies` to represent categorical features as one or more dummy variables.\n",
+    "- Apply `train_test_split` to randomly split a dataset into a training set and test set.\n",
+    "- Evaluate machine learning models in terms of overfit and underfit."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2d4ceb33-7b23-4e11-ac26-0fb51e386bc0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.linear_model import LinearRegression\n",
+    "from sklearn.metrics import mean_squared_error\n",
+    "\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2a116488-57f1-4b0c-bb8c-312663260de6",
+   "metadata": {},
+   "source": [
+    "## Dummy variables\n",
+    "\n",
+    "Last time, we tried to predict the price of a home from all the other variables. However, we learned that since the `city` column contains *categorical* string values, we can't use it as a feature in a regression algorithm."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "343f4bfc-b744-4f03-a918-7aacd503d5df",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "homes = pd.read_csv(\"homes.csv\")\n",
+    "homes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e85a5ca-c3fc-4d9c-877d-21e2b54681eb",
+   "metadata": {},
+   "source": [
+    "This problem not only occurs when using `DecisionTreeRegressor`, but it also occurs when using `LinearRegression` since we can't multiply a *categorical* string value with a *numeric* slope value. Let's learn how to represent these categorical features as one or more dummy variables."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dccf776c-c179-4369-a5ea-c77d61aa3234",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def model_parameters(reg, columns):\n",
+    "    \"\"\"Returns a string with the linear regression model parameters for the given column names.\"\"\"\n",
+    "    slopes = [f\"{coef:.2f}({columns[i]})\" for i, coef in enumerate(reg.coef_)]\n",
+    "    return \" + \".join([f\"{reg.intercept_:.2f}\"] + slopes)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8c204c21-06b2-4383-87bd-da14437994ad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X = homes.drop(\"price\", axis=1)\n",
+    "y = homes[\"price\"]\n",
+    "reg = LinearRegression().fit(X, y)\n",
+    "\n",
+    "print(\"Model:\", model_parameters(reg, X.columns))\n",
+    "print(\"Error:\", mean_squared_error(y, reg.predict(X)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23fa4a7d-d1bf-4835-bea0-b047868d2307",
+   "metadata": {},
+   "source": [
+    "## Overfitting\n",
+    "\n",
+    "In our introduction to machine learning, we explained that the researchers who worked on calibrating the PurpleAir Sensor (PAS) measurements against the EPA Air Quality Sensor (AQS) measurements ultimately decided to use the model that included only the PAS and humidity features (variables)—ignoring the opportunity to use the temperature and dew point even though a model that includes all features produced a lower overall mean squared error."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "053df79d-5341-4dee-a1f2-fdfe2c0164e7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sensor_data = pd.read_csv(\"sensor_data.csv\")\n",
+    "sensor_data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6f0db273-07a5-4921-a971-191f1752b275",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X = sensor_data.drop(\"AQS\", axis=1)\n",
+    "y = sensor_data[\"AQS\"]\n",
+    "reg = LinearRegression().fit(X, y)\n",
+    "\n",
+    "print(\"Model:\", model_parameters(reg, X.columns))\n",
+    "# squared=False for RMSE: root mean squared error\n",
+    "print(\"Error:\", mean_squared_error(y, reg.predict(X), squared=False))"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e41ef189-bd43-4cfb-88a2-ef439f9f7416",
+   "metadata": {},
+   "source": [
+    "In fact, the authors ([Barkjohn et al. 2021](https://doi.org/10.5194/amt-14-4617-2021\n",
+    ")) tested the impact of incrementally introducing a feature to the model to determine which combination of features could provide the most useful model—not necessarily the most accurate one. In their research, they aimed to predict the PAS measurement from the AQS measurement, the opposite of our task, and they included interactions between features indicated by the × symbol. For each grouping of models with a certain number of features, they highlighted the model with the lowest **root mean squared error** (RMSE, or the square root of the MSE) using an asterisk `*`.\n",
+    "\n",
+    "![Table 2 from Barkjohn et al. 2021 depicting every combination of features used in linear models](https://amt.copernicus.org/articles/14/4617/2021/amt-14-4617-2021-t02.png)\n",
+    "\n",
+    "The model at the bottom of the table that includes all the features also has the lowest RMSE loss. But the loss value alone is not a convincing measure: adding more features into a model not only leads to a model that is harder to explain, but also increases the possibility of overfitting.\n",
+    "\n",
+    "A model is considered **overfit** if its predictions correspond too closely to the *training dataset* such that improvements reported on the training dataset are not reflected when the model is run on a *testing dataset* (or in the real world). To simulate training and testing datasets, we can take our `X` features and `y` labels and subdivide them into `X_train, X_test, y_train, y_test` using the `train_test_split` function. **The testing dataset should only be used during final model evaluation when we're ready to report the overall effectiveness of our final machine learning model.**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "41003aba-b994-4138-98ff-d245f365680b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "# test_size=0.2 indicates 80% training dataset and 20% testing dataset\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n",
+    "\n",
+    "# During model training, use only the training dataset\n",
+    "reg = LinearRegression().fit(X_train, y_train)\n",
+    "\n",
+    "print(\"Model:\", model_parameters(reg, X_train.columns))\n",
+    "# During model evaluation, use the testing dataset\n",
+    "print(\"Error:\", mean_squared_error(y_test, reg.predict(X_test), squared=False))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "44b8bc22-2491-4dbf-b089-e4b8d5d3640c",
+   "metadata": {},
+   "source": [
+    "## Feature selection\n",
+    "\n",
+    "**Feature selection** describes the process of selecting only a subset of features in order to improve the quality of a machine learning model. In the air quality sensor calibration study, we can begin with all the features and gradually remove the least-important features one-by-one."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "73706461-c23c-4b52-b212-b52d3f2512f5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "features = [\"PAS\", \"humidity\", \"temp\", \"dew\"]\n",
+    "\n",
+    "reg = LinearRegression().fit(X_train.loc[:, features], y_train)\n",
+    "print(\"Model:\", model_parameters(reg, features))\n",
+    "print(\"Error:\", mean_squared_error(y_test, reg.predict(X_test.loc[:, features]), squared=False))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12608799-a3f3-487d-b95b-c2e12c049343",
+   "metadata": {},
+   "source": [
+    "[Recursive feature elimination](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) (RFE) automates this process by starting with a model that includes all the variables and removes features from the model starting with the ones that contribute the least weight (smallest coefficients in a linear regression)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1333edc7-cb42-4ede-8d5e-0b354e6f67ab",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.feature_selection import RFE\n",
+    "\n",
+    "# Remove 1 feature per step until half the original features remain\n",
+    "rfe = RFE(LinearRegression(), step=1, n_features_to_select=0.5, verbose=1)\n",
+    "rfe.fit(X_train, y_train)\n",
+    "\n",
+    "# Show the final subset of features\n",
+    "rfe_features = X.columns[[r == 1 for r in rfe.ranking_]]\n",
+    "print(\"Features:\", list(rfe_features))\n",
+    "\n",
+    "# Extract the last LinearRegression model trained on the final subset of features\n",
+    "reg = rfe.estimator_\n",
+    "\n",
+    "print(\"Model:\", model_parameters(reg, rfe_features))\n",
+    "print(\"Error:\", mean_squared_error(y_test, rfe.predict(X_test), squared=False))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6348469-619c-448d-9850-dc05815477c5",
+   "metadata": {},
+   "source": [
+    "## Cross validation\n",
+    "\n",
+    "How do we know when to stop removing features during feature selection? We can certainly use intuition or look at the changes in error as we remove each feature. But this still requires us to evaluate the model somehow. If the testing dataset can only be used at the end of model evaluation, **it was wrong of us to use the testing dataset during feature selection**!\n",
+    "\n",
+    "Cross-validation provides one way to help us explore different models before we choose a final model for evaluation at the end. [**Cross-validation**](https://scikit-learn.org/stable/modules/cross_validation.html) lets us evaluate models without touching the testing dataset by introducing new *validation datasets*.\n",
+    "\n",
+    "The simplest way to cross-validate is to call the `cross_val_score` helper function on an unfitted machine learning algorithm and the dataset. This function will further subdivide the training dataset into 5 folds and, for each of the 5 folds, train a separate model on the training folds and evaluate them on the validation fold.\n",
+    "\n",
+    "![A depiction of a 5 fold cross validation on a training set, while holding out a test set](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)\n",
+    "\n",
+    "The `scoring` parameter can accept the string name of [a scorer function](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter). Higher (more positive) scores are considered better, so we use the negative RMSE value as the scorer function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9ea0a51b-366a-4c2c-9035-556dc918e3f9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import cross_val_score\n",
+    "from sklearn.tree import DecisionTreeRegressor\n",
+    "\n",
+    "cross_val_score(\n",
+    "    estimator=DecisionTreeRegressor(max_depth=2),\n",
+    "    X=X_train,\n",
+    "    y=y_train,\n",
+    "    scoring=\"neg_root_mean_squared_error\",\n",
+    "    verbose=3,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "529a1ca2-d496-4d85-895a-7cdef97e77f7",
+   "metadata": {},
+   "source": [
+    "As we've seen throughout these lessons on machine learning, we prefer to automate our processes. Rather than modify the `max_depth` and manually tweaking the values until we find something that works, we can use `GridSearchCV` to exhaustively search all hyperparameter options. Here, the first 5 folds for `max_depth=2` are exactly the same as `cross_val_score`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3de69e19-cbbe-4902-85df-1ee0d37c3be5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import GridSearchCV\n",
+    "\n",
+    "search = GridSearchCV(\n",
+    "    estimator=DecisionTreeRegressor(),\n",
+    "    param_grid={\"max_depth\": [2, 3, 4, 5, 6, 7, 8, 9, 10]},\n",
+    "    scoring=\"neg_root_mean_squared_error\",\n",
+    "    verbose=3,\n",
+    ")\n",
+    "search.fit(X_train, y_train)\n",
+    "\n",
+    "# Show the best score and best estimator at the end of hyperparameter search\n",
+    "print(\"Mean score for best model:\", search.best_score_)\n",
+    "reg = search.best_estimator_\n",
+    "print(\"Best model:\", reg)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "de94f45e-2427-488d-b63c-82be388e7f23",
+   "metadata": {},
+   "source": [
+    "Finally, we can report the test error for the best model by evaluating it against the testing dataset. Why is the testing dataset error different from the mean score for the best model printed above?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8c93d46d-cb6a-49a5-8876-0c695c376668",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Error:\", mean_squared_error(y_test, reg.predict(X_test), squared=False))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a8c39ec2-4a43-44be-bf91-c11d38f79c83",
+   "metadata": {},
+   "source": [
+    "## Visualizing decision tree models\n",
+    "\n",
+    "Last time, we plotted the predictions for a linear regression model that was trained to take PAS measurements and predict AQS measurements. What do you think a decision tree model would look like for this simplified PurpleAir sensor calibration problem?\n",
+    "\n",
+    "Here's a complete workflow for decision tree model evaluation using the practices above. The resulting plot compares a linear regression model `lmplot` against the decisions made by a decision tree."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3198171d-7a24-4119-82dc-a42f0f52f25d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Split dataset into 80% training dataset and 20% testing dataset\n",
+    "X = sensor_data.drop(\"AQS\", axis=1)\n",
+    "y = sensor_data[\"AQS\"]\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n",
+    "\n",
+    "# Recursive feature elimination to select the single most important feature based on slope value\n",
+    "rfe = RFE(LinearRegression(), n_features_to_select=1, verbose=1)\n",
+    "rfe.fit(X_train, y_train)\n",
+    "# Print the best feature to predict AQS\n",
+    "rfe_feature = X.columns[rfe.ranking_.argmin()]\n",
+    "print(\"Best feature to predict AQS:\", rfe_feature)\n",
+    "# Use only the best feature\n",
+    "X = X[[rfe_feature]]\n",
+    "X_train = X_train[[rfe_feature]]\n",
+    "X_test = X_test[[rfe_feature]]\n",
+    "\n",
+    "# Grid search cross-validation to tune the max_depth hyperparameter using RMSE loss metric\n",
+    "search = GridSearchCV(\n",
+    "    estimator=DecisionTreeRegressor(),\n",
+    "    param_grid={\"max_depth\": [2, 3, 4, 5, 6, 7, 8, 9, 10]},\n",
+    "    scoring=\"neg_root_mean_squared_error\",\n",
+    "    verbose=3,\n",
+    ")\n",
+    "search.fit(X_train, y_train)\n",
+    "# Print the best score and best estimator at the end of hyperparameter search\n",
+    "print(\"Mean score for best model:\", search.best_score_)\n",
+    "reg = search.best_estimator_\n",
+    "print(\"Best model:\", reg)\n",
+    "\n",
+    "# During model evaluation, use the testing dataset\n",
+    "print(\"Test error:\", mean_squared_error(y_test, search.predict(X_test), squared=False))\n",
+    "\n",
+    "# Visualize the tree decisions compared to a LinearRegression model (lmplot)\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import seaborn as sns\n",
+    "from sklearn.tree import plot_tree\n",
+    "sns.set_theme()\n",
+    "grid = sns.lmplot(sensor_data, x=rfe_feature, y=\"AQS\")\n",
+    "# Create a demonstration dataset that counts from 0 to the max PAS value\n",
+    "X_demo = pd.DataFrame(np.arange(X[rfe_feature].max()), columns=[rfe_feature])\n",
+    "grid.facet_axis(0, 0).plot(X_demo, reg.predict(X_demo), c=\"orange\")\n",
+    "grid.set(title=f\"lmplot vs {reg}\")\n",
+    "# Show nodes in the decision tree\n",
+    "plt.figure(dpi=300)\n",
+    "plot_tree(\n",
+    "    reg,\n",
+    "    max_depth=2, # Only show the first two levels\n",
+    "    linewidth=3,\n",
+    "    feature_names=[rfe_feature],\n",
+    "    label=\"root\",\n",
+    "    filled=True,\n",
+    "    impurity=False,\n",
+    "    proportion=True,\n",
+    "    rounded=False\n",
+    ") and None # Hide return value of plot_tree"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4ab80bc3-acd4-472a-a537-ea2d2ff5039b",
+   "metadata": {},
+   "source": [
+    "The testing dataset error rates for both the `DecisionTreeRegressor` and the `LinearRegression` models are not too far apart. Which model would you prefer to use? Justify your answer using either the error rates below or the visualizations above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5884ec8b-1ae1-485f-9736-f035e1ec7ca3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Tree error:\", mean_squared_error(y_test, search.predict(X_test), squared=False))\n",
+    "print(\"Line error:\", mean_squared_error(\n",
+    "    y_test,\n",
+    "    LinearRegression().fit(X_train, y_train).predict(X_test),\n",
+    "    squared=False\n",
+    "))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "623aa026-c92a-48a3-8cac-ca90e53222b6",
+   "metadata": {},
+   "source": [
+    "Earlier, we discussed how overfitting to the testing dataset can be mitigated by using cross-validation. But in answering the previous question on whether we should prefer a `DecisionTreeRegressor` model or a `LinearRegression` model, we've actually overfit the model by hand! Explain how preferring one model over the other according to the visualization overfits to the testing dataset."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+%% Cell type:markdown id:c055d3c2-a5cb-41f5-8f22-fee73f12c96f tags:
+
+# Model Evaluation
+
+In this lesson, we'll learn how to evaluate the quality of a machine learning model. By the end of this lesson, students will be able to:
+
+- Apply `get_dummies` to represent categorical features as one or more dummy variables.
+- Apply `train_test_split` to randomly split a dataset into a training set and test set.
+- Evaluate machine learning models in terms of overfit and underfit.
+
+%% Cell type:code id:2d4ceb33-7b23-4e11-ac26-0fb51e386bc0 tags:
+
+``` python
+from sklearn.linear_model import LinearRegression
+from sklearn.metrics import mean_squared_error
+
+import pandas as pd
+```
+
+%% Cell type:markdown id:2a116488-57f1-4b0c-bb8c-312663260de6 tags:
+
+## Dummy variables
+
+Last time, we tried to predict the price of a home from all the other variables. However, we learned that since the `city` column contains *categorical* string values, we can't use it as a feature in a regression algorithm.
+
+%% Cell type:code id:343f4bfc-b744-4f03-a918-7aacd503d5df tags:
+
+``` python
+homes = pd.read_csv("homes.csv")
+homes
+```
+
+%% Cell type:markdown id:6e85a5ca-c3fc-4d9c-877d-21e2b54681eb tags:
+
+This problem not only occurs when using `DecisionTreeRegressor`, but it also occurs when using `LinearRegression` since we can't multiply a *categorical* string value with a *numeric* slope value. Let's learn how to represent these categorical features as one or more dummy variables.
+
+%% Cell type:code id:dccf776c-c179-4369-a5ea-c77d61aa3234 tags:
+
+``` python
+def model_parameters(reg, columns):
+    """Returns a string with the linear regression model parameters for the given column names."""
+    slopes = [f"{coef:.2f}({columns[i]})" for i, coef in enumerate(reg.coef_)]
+    return " + ".join([f"{reg.intercept_:.2f}"] + slopes)
+```
+
+%% Cell type:code id:8c204c21-06b2-4383-87bd-da14437994ad tags:
+
+``` python
+X = homes.drop("price", axis=1)
+y = homes["price"]
+reg = LinearRegression().fit(X, y)
+
+print("Model:", model_parameters(reg, X.columns))
+print("Error:", mean_squared_error(y, reg.predict(X)))
+```
+
+%% Cell type:markdown id:23fa4a7d-d1bf-4835-bea0-b047868d2307 tags:
+
+## Overfitting
+
+In our introduction to machine learning, we explained that the researchers who worked on calibrating the PurpleAir Sensor (PAS) measurements against the EPA Air Quality Sensor (AQS) measurements ultimately decided to use the model that included only the PAS and humidity features (variables)—ignoring the opportunity to use the temperature and dew point even though a model that includes all features produced a lower overall mean squared error.
+
+%% Cell type:code id:053df79d-5341-4dee-a1f2-fdfe2c0164e7 tags:
+
+``` python
+sensor_data = pd.read_csv("sensor_data.csv")
+sensor_data
+```
+
+%% Cell type:code id:6f0db273-07a5-4921-a971-191f1752b275 tags:
+
+``` python
+X = sensor_data.drop("AQS", axis=1)
+y = sensor_data["AQS"]
+reg = LinearRegression().fit(X, y)
+
+print("Model:", model_parameters(reg, X.columns))
+# squared=False for RMSE: root mean squared error
+print("Error:", mean_squared_error(y, reg.predict(X), squared=False))
+```
+
+%% Cell type:markdown id:e41ef189-bd43-4cfb-88a2-ef439f9f7416 tags:
+
+In fact, the authors ([Barkjohn et al. 2021](https://doi.org/10.5194/amt-14-4617-2021
+)) tested the impact of incrementally introducing a feature to the model to determine which combination of features could provide the most useful model—not necessarily the most accurate one. In their research, they aimed to predict the PAS measurement from the AQS measurement, the opposite of our task, and they included interactions between features indicated by the × symbol. For each grouping of models with a certain number of features, they highlighted the model with the lowest **root mean squared error** (RMSE, or the square root of the MSE) using an asterisk `*`.
+
+![Table 2 from Barkjohn et al. 2021 depicting every combination of features used in linear models](https://amt.copernicus.org/articles/14/4617/2021/amt-14-4617-2021-t02.png)
+
+The model at the bottom of the table that includes all the features also has the lowest RMSE loss. But the loss value alone is not a convincing measure: adding more features into a model not only leads to a model that is harder to explain, but also increases the possibility of overfitting.
+
+A model is considered **overfit** if its predictions correspond too closely to the *training dataset* such that improvements reported on the training dataset are not reflected when the model is run on a *testing dataset* (or in the real world). To simulate training and testing datasets, we can take our `X` features and `y` labels and subdivide them into `X_train, X_test, y_train, y_test` using the `train_test_split` function. **The testing dataset should only be used during final model evaluation when we're ready to report the overall effectiveness of our final machine learning model.**
+
+%% Cell type:code id:41003aba-b994-4138-98ff-d245f365680b tags:
+
+``` python
+from sklearn.model_selection import train_test_split
+
+# test_size=0.2 indicates 80% training dataset and 20% testing dataset
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+
+# During model training, use only the training dataset
+reg = LinearRegression().fit(X_train, y_train)
+
+print("Model:", model_parameters(reg, X_train.columns))
+# During model evaluation, use the testing dataset
+print("Error:", mean_squared_error(y_test, reg.predict(X_test), squared=False))
+```
+
+%% Cell type:markdown id:44b8bc22-2491-4dbf-b089-e4b8d5d3640c tags:
+
+## Feature selection
+
+**Feature selection** describes the process of selecting only a subset of features in order to improve the quality of a machine learning model. In the air quality sensor calibration study, we can begin with all the features and gradually remove the least-important features one-by-one.
+
+%% Cell type:code id:73706461-c23c-4b52-b212-b52d3f2512f5 tags:
+
+``` python
+features = ["PAS", "humidity", "temp", "dew"]
+
+reg = LinearRegression().fit(X_train.loc[:, features], y_train)
+print("Model:", model_parameters(reg, features))
+print("Error:", mean_squared_error(y_test, reg.predict(X_test.loc[:, features]), squared=False))
+```
+
+%% Cell type:markdown id:12608799-a3f3-487d-b95b-c2e12c049343 tags:
+
+[Recursive feature elimination](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) (RFE) automates this process by starting with a model that includes all the variables and removes features from the model starting with the ones that contribute the least weight (smallest coefficients in a linear regression).
+
+%% Cell type:code id:1333edc7-cb42-4ede-8d5e-0b354e6f67ab tags:
+
+``` python
+from sklearn.feature_selection import RFE
+
+# Remove 1 feature per step until half the original features remain
+rfe = RFE(LinearRegression(), step=1, n_features_to_select=0.5, verbose=1)
+rfe.fit(X_train, y_train)
+
+# Show the final subset of features
+rfe_features = X.columns[[r == 1 for r in rfe.ranking_]]
+print("Features:", list(rfe_features))
+
+# Extract the last LinearRegression model trained on the final subset of features
+reg = rfe.estimator_
+
+print("Model:", model_parameters(reg, rfe_features))
+print("Error:", mean_squared_error(y_test, rfe.predict(X_test), squared=False))
+```
+
+%% Cell type:markdown id:f6348469-619c-448d-9850-dc05815477c5 tags:
+
+## Cross validation
+
+How do we know when to stop removing features during feature selection? We can certainly use intuition or look at the changes in error as we remove each feature. But this still requires us to evaluate the model somehow. If the testing dataset can only be used at the end of model evaluation, **it was wrong of us to use the testing dataset during feature selection**!
+
+Cross-validation provides one way to help us explore different models before we choose a final model for evaluation at the end. [**Cross-validation**](https://scikit-learn.org/stable/modules/cross_validation.html) lets us evaluate models without touching the testing dataset by introducing new *validation datasets*.
+
+The simplest way to cross-validate is to call the `cross_val_score` helper function on an unfitted machine learning algorithm and the dataset. This function will further subdivide the training dataset into 5 folds and, for each of the 5 folds, train a separate model on the training folds and evaluate them on the validation fold.
+
+![A depiction of a 5 fold cross validation on a training set, while holding out a test set](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)
+
+The `scoring` parameter can accept the string name of [a scorer function](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter). Higher (more positive) scores are considered better, so we use the negative RMSE value as the scorer function.
+
+%% Cell type:code id:9ea0a51b-366a-4c2c-9035-556dc918e3f9 tags:
+
+``` python
+from sklearn.model_selection import cross_val_score
+from sklearn.tree import DecisionTreeRegressor
+
+cross_val_score(
+    estimator=DecisionTreeRegressor(max_depth=2),
+    X=X_train,
+    y=y_train,
+    scoring="neg_root_mean_squared_error",
+    verbose=3,
+)
+```
+
+%% Cell type:markdown id:529a1ca2-d496-4d85-895a-7cdef97e77f7 tags:
+
+As we've seen throughout these lessons on machine learning, we prefer to automate our processes. Rather than modify the `max_depth` and manually tweaking the values until we find something that works, we can use `GridSearchCV` to exhaustively search all hyperparameter options. Here, the first 5 folds for `max_depth=2` are exactly the same as `cross_val_score`.
+
+%% Cell type:code id:3de69e19-cbbe-4902-85df-1ee0d37c3be5 tags:
+
+``` python
+from sklearn.model_selection import GridSearchCV
+
+search = GridSearchCV(
+    estimator=DecisionTreeRegressor(),
+    param_grid={"max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10]},
+    scoring="neg_root_mean_squared_error",
+    verbose=3,
+)
+search.fit(X_train, y_train)
+
+# Show the best score and best estimator at the end of hyperparameter search
+print("Mean score for best model:", search.best_score_)
+reg = search.best_estimator_
+print("Best model:", reg)
+```
+
+%% Cell type:markdown id:de94f45e-2427-488d-b63c-82be388e7f23 tags:
+
+Finally, we can report the test error for the best model by evaluating it against the testing dataset. Why is the testing dataset error different from the mean score for the best model printed above?
+
+%% Cell type:code id:8c93d46d-cb6a-49a5-8876-0c695c376668 tags:
+
+``` python
+print("Error:", mean_squared_error(y_test, reg.predict(X_test), squared=False))
+```
+
+%% Cell type:markdown id:a8c39ec2-4a43-44be-bf91-c11d38f79c83 tags:
+
+## Visualizing decision tree models
+
+Last time, we plotted the predictions for a linear regression model that was trained to take PAS measurements and predict AQS measurements. What do you think a decision tree model would look like for this simplified PurpleAir sensor calibration problem?
+
+Here's a complete workflow for decision tree model evaluation using the practices above. The resulting plot compares a linear regression model `lmplot` against the decisions made by a decision tree.
+
+%% Cell type:code id:3198171d-7a24-4119-82dc-a42f0f52f25d tags:
+
+``` python
+# Split dataset into 80% training dataset and 20% testing dataset
+X = sensor_data.drop("AQS", axis=1)
+y = sensor_data["AQS"]
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+
+# Recursive feature elimination to select the single most important feature based on slope value
+rfe = RFE(LinearRegression(), n_features_to_select=1, verbose=1)
+rfe.fit(X_train, y_train)
+# Print the best feature to predict AQS
+rfe_feature = X.columns[rfe.ranking_.argmin()]
+print("Best feature to predict AQS:", rfe_feature)
+# Use only the best feature
+X = X[[rfe_feature]]
+X_train = X_train[[rfe_feature]]
+X_test = X_test[[rfe_feature]]
+
+# Grid search cross-validation to tune the max_depth hyperparameter using RMSE loss metric
+search = GridSearchCV(
+    estimator=DecisionTreeRegressor(),
+    param_grid={"max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10]},
+    scoring="neg_root_mean_squared_error",
+    verbose=3,
+)
+search.fit(X_train, y_train)
+# Print the best score and best estimator at the end of hyperparameter search
+print("Mean score for best model:", search.best_score_)
+reg = search.best_estimator_
+print("Best model:", reg)
+
+# During model evaluation, use the testing dataset
+print("Test error:", mean_squared_error(y_test, search.predict(X_test), squared=False))
+
+# Visualize the tree decisions compared to a LinearRegression model (lmplot)
+import matplotlib.pyplot as plt
+import numpy as np
+import seaborn as sns
+from sklearn.tree import plot_tree
+sns.set_theme()
+grid = sns.lmplot(sensor_data, x=rfe_feature, y="AQS")
+# Create a demonstration dataset that counts from 0 to the max PAS value
+X_demo = pd.DataFrame(np.arange(X[rfe_feature].max()), columns=[rfe_feature])
+grid.facet_axis(0, 0).plot(X_demo, reg.predict(X_demo), c="orange")
+grid.set(title=f"lmplot vs {reg}")
+# Show nodes in the decision tree
+plt.figure(dpi=300)
+plot_tree(
+    reg,
+    max_depth=2, # Only show the first two levels
+    linewidth=3,
+    feature_names=[rfe_feature],
+    label="root",
+    filled=True,
+    impurity=False,
+    proportion=True,
+    rounded=False
+) and None # Hide return value of plot_tree
+```
+
+%% Cell type:markdown id:4ab80bc3-acd4-472a-a537-ea2d2ff5039b tags:
+
+The testing dataset error rates for both the `DecisionTreeRegressor` and the `LinearRegression` models are not too far apart. Which model would you prefer to use? Justify your answer using either the error rates below or the visualizations above.
+
+%% Cell type:code id:5884ec8b-1ae1-485f-9736-f035e1ec7ca3 tags:
+
+``` python
+print("Tree error:", mean_squared_error(y_test, search.predict(X_test), squared=False))
+print("Line error:", mean_squared_error(
+    y_test,
+    LinearRegression().fit(X_train, y_train).predict(X_test),
+    squared=False
+))
+```
+
+%% Cell type:markdown id:623aa026-c92a-48a3-8cac-ca90e53222b6 tags:
+
+Earlier, we discussed how overfitting to the testing dataset can be mitigated by using cross-validation. But in answering the previous question on whether we should prefer a `DecisionTreeRegressor` model or a `LinearRegression` model, we've actually overfit the model by hand! Explain how preferring one model over the other according to the visualization overfits to the testing dataset.