Commit 42f7c104 authored by Dan Suciu's avatar Dan Suciu
Browse files

working files

parent c13901f0
# Homework 1
## Due Monday, Oct 18th at 11:59pm
Please wait until we have the first section to start this homework. The section contains some important setup steps.
See [git instructions]( to obtain your copy of the homework repository.
## Objectives:
Learn how to use a shared-nothing, relational database management system (DBMS) offered
as a cloud service. We will use <a href="">Amazon Redshift</a> on the
<a href="">Amazon Web Services (AWS)</a> cloud. In the assignment, we will set-up Redshift clusters, ingest data, and execute simple queries.
## Assignment tools:
- <a href="">Amazon Redshift</a>
- <a href="">Amazon Web Services</a></p>
*Amazon Redshift is a very expensive cloud service.*
For that reason, the assignment will not use datasets nor clusters that are very large. Additionally, we will use the free trial of the service.
## To activate the free trial:
- Go to: <a href=""></a>
- Click on "Get started with a free 2-month trial" ("AWS Free Tier")
- Read the instructions
- The instructions there are **not** enough: please check Section 1 on the steps you need to take to deploy your cluster.
*IMPORTANT* The Amazon Redshift free trial is limited to the *DC2.Large* node type. Please use ONLY those instance types in this assignment. Since we are using Redshift Spectrum to query data directly from s3 in this assignment, use the *us-west-2* (Oregon) region. You can change the region at the upper right corner in the Redshift console.
*IMPORTANT* When you don't use the cluster, remember to `pause` it; pausing and later resuming takes about 10 minutes, but if you don't do it, you run out of credits quickly. See last item.
## What to turn in:
You will turn in:
- SQL for the queries
- Runtime for each query
- A brief discussion of the query runtimes that you observed in different settings
Use the templates under `starter-code` and place your answers in the `submission` directory.
## How to submit the assignment:
In your GitLab repository you should see a directory called `hw/hw1/submission`. Put your report in that directory. Remember to commit and push your report to GitLab (`git add && git commit && git push`)!
You should add your report early and keep updating it and pushing it as you do more work. We will collect the final version after the deadline passes. If you need extra time on an assignment, let us know. This is a graduate course, so we are reasonably flexible with deadlines but please do not overuse this flexibility. Use extra time only when you truly need it.
# Assignment Details
In this Assignment you will be required to deploy a Redshift cluster, ingest data, and run some queries on this data. You should have deployed the Redshift cluster during the first section. Details about how to deploy the Redshift cluster are explained in the first section. Note that you may need to wait some time before the cluster becomes available ("Status" of the cluster).
## Create Table and Ingest Data in Amazon Redshift (0 points)
During the first section we ingested 1GB subsets of the TPCH data to Redshift. In this assignment you also need the 10GB subsets, and you can use the commands in `starter-code/createTablesAndCopy.sql`for this purpose, but **you must update your `aws_iam_role`** before running the COPY command in that file.
You can read more about IAM roles and ARN <a href="">here</a>.
## Run Queries
Run each query listed below multiple times.
Report the average, min and max run time in `submission/time.txt` (use the template in `starter-code/time.txt`).
Use the warm cache timing, which means you discard the first time the query is run. Go to the Query tab on the AWS web console for your Redshift cluster to view the runtime of the queries.
1. (25 points) Write and run queries on the *1GB* dataset and a 2-node cluster. Report the timing in `submission/time.txt`.
- What is the total number of parts offered by each supplier? The query should return the name of the supplier and the total number of parts. Write your query in `submission/1a.sql`. We will directly copy this file and execute on Redshift, so make sure to check that works.
- What is the cost of the most expensive part by any supplier? The query should return only the price of that most expensive part. No need to return the name. Write your query in `submission/1b.sql`.
- What is the cost of the most expensive part for each supplier? The query should return the name of the supplier and the cost of the most expensive part but you do not need to return the name of that part. Write your query in `submission/1c.sql`.
- What is the total number of customers per nation? The query should return the name of the nation and the number of unique customers. Write your query in `submission/1d.sql`.
- What is number of parts shipped between 10 oct, 1996 and 10 nov, 1996 for each supplier? The query should return the name of the supplier and the number of parts. Write your query in `submission/1e.sql`.
2. (25 points) Run queries from (1) on the *10GB* dataset and record the timing in `submission/time.txt`.
You can remove the 1GB data by executing `DELETE FROM table_name` for each table. Note that you will run more queries on the 1GB dataset below. You may want to do those questions first. After you delete the 1GB dataset, load the 10GB dataset.
The lineitem table is the largest table for the 10 GB dataset. Load this table from parallel file segments instead of the single file. The data for this table is divided into 10 segments. They are named `lineitem.tbl.1`, `lineitem.tbl.2`, ..., `lineitem.tbl.10` in the bucket `s3://uw-csed516/tpch/10GB-split/`
3. (20 points) Run queries from (1) on the 10 GB dataset but this time increase the cluster size from 2 to 4 nodes. Record the timing in `submission/time.txt`.
To do this, you can either create a new, larger cluster or you can use the cluster resize feature available from the AWS console.
4. (10 points) A customer is considered a _Gold_ customer if they have orders totalling more than $1,000,000.00. Customers with orders totalling between $1,000,000.00 and $500,000.00 are considered _Silver_.
Write a SQL query to compute the number of customers in these two categories. Try different methods of writing the query (only SQL or use a UDF or a View to categorize a user). Write your favorite version in `submission/4.sql`.
Discuss your experience with the various methods to carry out such analysis, in `submission/discussion.txt`.
Use the 1GB data set and the 2-node cluster.
5. (20 points) Query data on s3 vs local data: re-run the queries from (1) on the *10GB* data set on s3.
Data is located in `s3://uw-csed516/tpch/athena/` with a folder for each of the following tables: `customer`, `supplier`, `orders`, `region`, `nation`, `part`, `partsupp` &amp; `lineitem`.
The data is in textfile format and the delimiter is the pipe character (|). For example, the s3 location of the data for the lineitem relation is `s3://uw-csed516/tpch/athena/lineitem/`
Record the timing in `time.txt` and your thoughts in `discussion.txt`.
Here are sample dataload times for the 10 GB dataset to a 2 node cluster using the `d2.large` node type:
- customers: 33.86s
- orders: 2m 9s
- lineitems: 7m 28s (single file load) 3m 41s( multi file load)
- nation: 2.31s
- part: 24.73s
- partssup: 1m 23s
- region: 2.18s
- supplier: 8.27s
## When you've completed the assignment, don't forget to pause your RedShift clusters! Amazon's free trial will allow you to continuously run a cluster for about two weeks, so make sure you pause or terminated your instances. However, note that you may choose to use the RedShift cluster in your project: for that reason, we recommend that you pause it for now, and remember to terminate it later, after the project.
CREATE TABLE customer(
C_CustKey int ,
C_Name varchar(64) ,
C_Address varchar(64) ,
C_NationKey int ,
C_Phone varchar(64) ,
C_AcctBal decimal(13, 2) ,
C_MktSegment varchar(64) ,
C_Comment varchar(120) ,
skip varchar(64)
CREATE TABLE lineitem(
L_OrderKey int ,
L_PartKey int ,
L_SuppKey int ,
L_LineNumber int ,
L_Quantity int ,
L_ExtendedPrice decimal(13, 2) ,
L_Discount decimal(13, 2) ,
L_Tax decimal(13, 2) ,
L_ReturnFlag varchar(64) ,
L_LineStatus varchar(64) ,
L_ShipDate datetime ,
L_CommitDate datetime ,
L_ReceiptDate datetime ,
L_ShipInstruct varchar(64) ,
L_ShipMode varchar(64) ,
L_Comment varchar(64) ,
skip varchar(64)
N_NationKey int ,
N_Name varchar(64) ,
N_RegionKey int ,
N_Comment varchar(160) ,
skip varchar(64)
O_OrderKey int ,
O_CustKey int ,
O_OrderStatus varchar(64) ,
O_TotalPrice decimal(13, 2) ,
O_OrderDate datetime ,
O_OrderPriority varchar(15) ,
O_Clerk varchar(64) ,
O_ShipPriority int ,
O_Comment varchar(80) ,
skip varchar(64)
P_PartKey int ,
P_Name varchar(64) ,
P_Mfgr varchar(64) ,
P_Brand varchar(64) ,
P_Type varchar(64) ,
P_Size int ,
P_Container varchar(64) ,
P_RetailPrice decimal(13, 2) ,
P_Comment varchar(64) ,
skip varchar(64)
CREATE TABLE partsupp(
PS_PartKey int ,
PS_SuppKey int ,
PS_AvailQty int ,
PS_SupplyCost decimal(13, 2) ,
PS_Comment varchar(200) ,
skip varchar(64)
R_RegionKey int ,
R_Name varchar(64) ,
R_Comment varchar(160) ,
skip varchar(64)
CREATE TABLE supplier(
S_SuppKey int ,
S_Name varchar(64) ,
S_Address varchar(64) ,
S_NationKey int ,
S_Phone varchar(18) ,
S_AcctBal decimal(13, 2) ,
S_Comment varchar(105) ,
skip varchar(64)
copy customer from 's3://uw-csed516/tpch/10GB/customer.tbl' CREDENTIALS 'aws_iam_role=<your redshift role arn>' delimiter '|';
copy orders from 's3://uw-csed516/tpch/10GB/orders.tbl' REGION 'us-west-2' CREDENTIALS 'aws_iam_role=<your redshift role arn>' delimiter '|';
copy lineitem from 's3://uw-csed516/tpch/10GB/lineitem.tbl'REGION 'us-west-2' CREDENTIALS 'aws_iam_role=<your redshift role arn>' delimiter '|';
copy nation from 's3://uw-csed516/tpch/10GB/nation.tbl'REGION 'us-west-2' CREDENTIALS 'aws_iam_role=<your redshift role arn>' delimiter '|';
copy part from 's3://uw-csed516/tpch/10GB/part.tbl'REGION 'us-west-2' CREDENTIALS 'aws_iam_role=<your redshift role arn>' delimiter '|';
copy partsupp from 's3://uw-csed516/tpch/10GB/partsupp.tbl'REGION 'us-west-2' CREDENTIALS 'aws_iam_role=<your redshift role arn>' delimiter '|';
copy region from 's3://uw-csed516/tpch/10GB/region.tbl'REGION 'us-west-2' CREDENTIALS 'aws_iam_role=<your redshift role arn>' delimiter '|';
copy supplier from 's3://uw-csed516/tpch/10GB/supplier.tbl'REGION 'us-west-2' CREDENTIALS 'aws_iam_role=<your redshift role arn>' delimiter '|';
4. The query runs faster when the sun comes out.
5. The query runs faster after a good night's sleep.
\ No newline at end of file
Place your homework submission in this directory,
then commit and push.
# Homework 2
## Due Monday, November 1st at 11:59pm
## Objectives:
Become familiar with Apache Spark (and Databricks). Learn how to use Spark to perform
data management tasks and simple data analysis tasks.
## Assignment tools:
- [Databricks](
## What to turn in:
- Your code, in a single .ipynb file. Run your code to produce the results before you save the ipynb file, and make sure all outputs and plots are displayed.
- The outputs should be in this order, one output/plot per cell:
1. Gaussians for q1 (also copy and save the output in q1.txt)
2. The first plot for q2 (save in q2xyz.jpg)
3. The second plot for q2 (save in q2yzw.jpg)
- Record how long it takes to train, for each configuration. Specifically:
1. Save the training time (in seconds) for q1 in q1time.txt.
2. Save the time for q3 in q3time.txt.
3. Save the time for q4 in q4time.txt.
- For q5 & q6, write your thoughts in q5.txt and q6discussion.txt.
## How to submit the assignment:
As usual, put your files under hw2/submission, run `git add ...`, `git commit ...` and `git push`.
# Assignment Details
In this assignment, we will work with astronomy data. We first give an overview of the application that motivates the analysis we will perform in the assignment.
Large-scale astronomical imaging surveys (e.g., <a href="">Sloan Digital Sky Survey</a>) collect databases of telescope images. The key analysis is to extract sources (galaxies, stars, quasars, etc.) from these images. While extracting sources is a simple process, classifying these
sources into object types is difficult. An example of this type of
classification is the use of multiple wavelength observations in
separating high redshift quasars (i.e., galaxies powered by a central
black hole) from stars within our galaxy. Given that both quasars and
stars are point sources (i.e., they cannot be distinguished by data
from a single image alone) and that there are 400-times more stars
than quasars within a data set, the accuracy of this classification
determines the success of finding some of the highest redshift sources
within the universe.
The Gaussian Mixture Model (GMM) algorithm can serve as a foundation to perform this
classification. It is a clustering algorithm that will help us group
sources into different types. We describe this algorithm below.
In this assignment, we will use as input a table that contains sources
extracted from telescope images and their features. We will apply the GMM algorithm to build a model of those
For more details about the application, please refer to <a href="">[RM15]</a>.
## Data
The data takes the form of a table with five attributes:
| ID | X | Y | Z | W |
| 1237661088029606015 | 0.575469| 1.37509| 1.941| -0.0360003 |
| 1237661088024822536 | 1.00735| 3.06909| 3.701| -0.059 |
| 1237661088024822606| 1.4684| 2.5072| 3.184| -0.105 |
| 1237661088024887302 | 0.761256| 1.44754| 1.356| -0.0959997 |
The first attribute, `ID`, is a unique identifier for each source. Each of the other four attributes, `X`, `Y`, `Z`, and `W`
is a measurement for that source. Thus each column is a type of measurement and each row is a source.
## Algorithm
In our application, we have points (i.e., sources) in a 4D space. We
want to cluster them to identify different types of sources. We could
use different clustering algorithms. GMM is one
type of algorithm that has been shown to work well for these types of
applications, so we will use it.
A Gaussian mixture model is a model of a distribution as a
mixture of K separate multivariate normal distributions, each with three
parameters: mean, covariance, and weight. We will talk about
covariance rather than variance or standard deviation because our data
is in a 4D space. Thus, the mean is a 4D vector and the
covariance is a 4x4 matrix. The mean and covariance describe
the position and shape of the Gaussian. The weight captures the relative
amplitude of the Gaussian in the model. The sum of the weights of
the Gaussians in the model is 1.0. GMM assumes all the data
points are generated from this mixture of Gaussian distributions.
There is no closed-form analytic solution to estimate the parameters of a
mixture of Gaussians from a sample data set. The standard approach is
to iteratively maximize the likelihood function of the mixture model
in an algorithm called expectation maximization (EM).
Expectation maximization [AD77] has two steps: the E-step and M-step,
which are repeated until parameters converge to maximize the
likelihood of the observations. EM starts by assuming a set of K random
components. It computes for each observation the probability of
belonging to each component of the model. This is called the E-step.
The second step in the iterative process, called the M-step, uses the
points associated with each component to estimate better parameters for
each of the K components. The E-step and the M-step are repeated until
The following figure shows the results of applying GMM to a similar
astronomical data <a href="">[RM15]</a>. The figure shows two 2D projections of the
4D model.
<img src="./figs/points_2.png" title="GMM clusters" />
## Environment
Spark MLLib implements GMM and, in this assignment, we will use it to
cluster 1.9M astronomical sources:
1. Deploy Databricks: To deploy Databricks, see the instructions from the section.
* Make sure you edit your cluster to have autoscaling disabled, use `i3.xlarge` for both the worker type and driver type, and adjust #Workers to get 2/4/8 nodes in the clusters.
2. Data load and transform: Data is provided in the form of
CSV files in S3.
- Read the data from S3.
- Parse it and otherwise prepare it as needed for the analysis.
- Bucket: `uw-csed516`
- Object: `smalldatasetspark/wise-colors-15-20-subsetsmall8.csv`
- For example, you can call the following to get a RDD:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
rdd = sc.hadoopFile('s3n://uw-csed516/smalldatasetspark/',
3. Run GMM: From [MLlib on Spark](, the [GMM]( implementation is available in Python. Run GMM on the sources that you prepared above. Remember that the points are in four dimensions.
## Questions
### Find and describe components (20 Points)
- Run this experiment on a 2-node (i.e., 2-worker) cluster on Databricks. See the starter code folder for the format.
- Use the MLlib implementation of GMM with `k=7` components and the default `maxIterations` (which is `100`) to build a model of the data. Your code should save the timing in `q1time.txt` (to test your code, you can pass a smaller value for the `maxIterations` argument so that training finishes sooner; however, the time reported in `q1time.txt` should be the time with the default `maxIterations`).
- List the weight, mean(mu), and covariance matrix of each component, by printing the `weights` and `gaussians` properties of the model (if your GMM is called `model`, this means that you print `model.weights` and `model.gaussians`). Your code should save the output to a file named `q1.txt`. Include this file in your submission.
### Plot the source clusters (10 Points)
- Each point is represented in four dimensions(X,Y,Z,W).
- Plot the X, Y and Z dimensions on a scatter plot, with a different color for each cluster. Your code should save the plot to a file named `q2xyz.jpg`.
- Plot the Y, Z and W dimensions on a scatter plot, with a different color for each cluster. Your code should save the plot to a file named `q2yzw.jpg`.
### Explore speed-up (20 Points)
- In this part of the assignment, we will evaluate Spark's speed-up on the GMM algorithm.
- For this, we keep the data size and task fixed and vary the number of workers available for processing.
- Vary the number of workers from 2 to 4 to 8 and plot the runtime for finding 7 components as in question 1. You already ran part 1 with 2 workers. For this part, you only need to run GMM on 4- and 8-worker clusters. Report the runtimes (with the default `maxIterations`) in `q3time.txt`.
### Explore scale-up (20 points)
- In this part, we increase the data set size and see how the change in the data size impacts the analysis time. For this a larger dataset with 15M sources is located at:
- Bucket: `uw-csed516`
- Object: `largerdatasetspark/wise-colors-15-20-subset1.csv`
- For example, you can replace `smalldatasetspark` with `largerdatasetspark` in the `hadoopFile` call to get this data.
- Run GMM on this larger dataset and on the 8-worker cluster.
- Report the execution time (with the default `maxIterations`) in `q4time.txt`.
### Data management and analysis over subsets (10 points)
- Generate different-size subsets of the larger dataset (you can use selections to extract subsets of the
data). Run the GMM algorithm on those subsets.
- Comment on the query execution time, on the components that you find, and any other interesting findings. Write your thoughts in `q5.txt` (10 points)
### Analysis over dimensions (20 points)
- On the large dataset, repeat the experiment by running GMM using three out of four of the available dimensions.
- Comment on the query execution time and on the components that you find. Write your thoughts in `q6discussion.txt`.
## References
[AD77] A. P. Dempsteret al., <a href="">Maximum likelihood from incomplete data via the EM algorithm</a>, JRSS, 1977
[RM15] Ryan Maas et al., <a href="">Gaussian Mixture Models Use-Case: In-Memory Analysis with Myria</a>, VLDB, 2015
import os
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
from numpy import array
from pyspark.mllib.clustering import GaussianMixture, GaussianMixtureModel
rdd = sc.hadoopFile('s3n://uw-csed516/smalldatasetspark/',
import numpy as np
data = rdd.values()
data = x: np.fromstring(x, dtype=float, sep=','))
data = x: (x[0], x[1:]))
data = t: (int(t[0]), t[1]))
rdd2 = sc.hadoopFile('s3n://uw-csed516/largerdatasetspark/',
import numpy as np
data2 = rdd2.values()
data2 = x: np.fromstring(x, dtype=float, sep=','))
data2 = x: (x[0], x[1:]))
data2 = x: (int(x[0]), x[1]))
4 nodes: 999.99s
8 nodes: 999.99s
Place your homework submission in this directory,
then commit and push.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment