To practice advanced SQL. To get familiar with commercial database management systems (SQL Server) and using a database management system in the cloud (SQL Azure). To practice physical tuning through the addition of appropriate indexes.
**Assignment tools:**
SQL Server on Windows Azure through SQL Azure. SQL Server Management Studio
has been installed on the CSE lab and [VDI machines](http://vdi.cs.washington.edu/vdi/).
**Assigned date:** Tuesday, Oct. 10, 2017
**Due date:** Friday, Oct. 20, 2017. You have 1 2/3 weeks for this assignment.
**What to turn in:**
`hw3-q1.sql`, `hw3-q2.sql`, etc (see below).
**Resources:**
Prof. Cheung's research group has been working on a tool called [Scythe](https://courses.cs.washington.edu/courses/cse344/tools/scythe/)
that allows you to generate
SQL given input/output examples. Feel free to use it to learn SQL syntax. This is entirely optional and
is not "officially supported." Also, it is meant to be a
learning tool and not to do your homework for you! If you just turn in the SQL generated by Scythe, you will
very likely get 0 points for the assignment (not to mention that the generated SQL might not even be correct...).
## Assignment Details
This homework is a continuation of homework 2 but with three changes:
- The queries are more challenging
- You will get to use a commercial database system (i.e., no more SQLite :).
SQLite simply cannot execute these queries in any reasonable amount of
time; hence, we will use SQL Server, which has one of the most advanced
query optimizers. SQL Server also has a very nice client application,
[SQL Server Management Studio](https://docs.microsoft.com/en-us/sql/ssms/sql-server-management-studio-ssms),
that you will get to use in this assignment.
- You will use the Microsoft Azure cloud.
Here is again the schema of the `Flights` database, for your reference:
```SQL
FLIGHTS (fid int, year int, month_id int, day_of_month int,
We leave it up to you to decide how to declare these tables and translate their types in SQL Server. But make sure that your relations include all the attributes listed above. Note that SQL Server will complain if you try to ingest data and it needs to truncate it because VARCHAR fields are too small.
In addition, impose the following constraints:
- The primary key of the `FLIGHTS` table is `fid`.
- The primary keys for the other tables are `cid`, `mid`, and `did` respectively.
We provide the flights database as a set of plain-text data files in the linked
`.tar.gz` archive. Each file in this archive contains all the rows for the named table,
one row per line.
In this homework, you will do three things:
1. Create a database in the SQL Server database management system running as a service on
Windows Azure.
2. Write and test the SQL queries below; keep in mind that the queries are quite challenging,
both for you and for the database engine.
3. Reflect on using a database management system running in a public cloud.
### A. Setting up an Azure SQL Database [0 points]
In this assignment, we want you to learn how to use an Azure SQL database from scratch.
Your first step will thus be to setup a database in the Azure service and importing your data.
This step may seem tedious but it is crucially important. We want you to be able to continue using Azure after the class ends. For this, you need to know how to use the system starting from nothing.
**NOTE: These steps will take some time to complete, so start early!**
#### Step 1: Create an Azure account and log in to Azure portal
You should have received an email from invites@microsoft.com inviting you to join an organization.
Follow that link to apply the subscription to your account.
Afterwards, you will be forwarded to the [Azure portal](https://portal.azure.com/).
#### Step 2: Learn about Azure SQL Server
Spend some time clicking around, reading documentation, watching tutorials, and generally familiarizing yourself with Azure and SQL Server.
#### Step 3: Create a database
From the [Azure portal](https://portal.azure.com/), select "+ New" on the left,
then select "Databases", then select "SQL Database". This will bring up a panel with
configuration options for a new DB instance.
Perform the following configuration:
- Choose a database name (e.g., "cse344-17fa").
- Choose a name for the resource group that will be created (e.g., "myresourcegroup").
- Select a source: Blank database
- Create a new server by clicking on "Server" (it will say "Configure required settings"). A second panel will appear to the right; click on "Create a new server" and a third panel will appear to the right of this. Fill in the form as follows:
- Choose a name for the server (e.g., "fooBarSqlserver"). Unlike your database name,
the server name must be unique across the universe of Azure SQL databases.
- Choose an admin login and password. (You will need this when access your database using tools
other than the portal.)
- Set the location to "West US" or "West US 2".
- Make sure "Allow azure services to access server" is checked.
- SELECT.
- Change to a cheaper pricing tier. (The default is currently "Standard S2", which is
too expensive.) To do this, click on "Pricing tier", which will open a pane with "Standard"
selected. Find the slider for "DTUs", and turn it all the way down to 10. It should now say the monthly cost is only $15/month.
- APPLY
- Select "Pin to dashboard".
Your configuration should now look like this (again, replace 414 with 344):
Now click on the "Create" button to create this database.
You will see a panel on the dashboard that says "Deploying SQL Database" while the database
is being set up: this will take a while. Once that is done, it should open up a panel with details on your database. (If not, click on the panel for your SQL database to open it.)
Finally, click on the button near the top that says "Set server firewall"
Select "Query editor", and accept the warning that this is a preview feature (i.e., not necessarily what the final launched version will look like).
Click on the "Login" button and enter the username and password that you chose when you created your database in Step 3. Once you have done that, you can try entering SQL commands. Press the "Run" button to execute them.
#### Step 5: Use the database from SQL Server Management Studio
As convenient as the Azure portal Query editor is, you may want use SQL Server Management Studio (SSMS) in order to examine SQL Server's query execution plans.
If you have Windows, SSMS may already be installed or you should be able to download and install it from [Microsoft's web site](https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms).
(For detailed instructions on installing SSMS, see [this document](https://courses.cs.washington.edu/courses/cse414/17sp/sections/Install_SSMS.pdf).)
You can also use SSMS on the PC lab machines, where it is already installed, or on the [VDI lab machines](http://vdi.cs.washington.edu/vdi/).
(If you have trouble connecting from any of those machines, it may be because there was a mistake setting up the firewall rules above.)
Another option is that you can launch a Windows VM directly in Azure. The process is similar to how we created a database instance above.
Once your VM instance is running, open the panel for the instance and click on "Connect" to download a `.rdp` file. On a Mac, you can then connect to your Windows machine by using
Microsoft Remote Desktop. Once connected, you can install SSMS using the link above.
When you launch SSMS, it will ask you to connect to a SQL server instance. Tell it the name of the server you created in Step 3 above.
Then, select "SQL authentication" (rather than Windows authentication) and give it your username and password from Step 3. At that point, you should be able to see the tables of your database in the panel on the left side.
Buttons to create a new query and execute it are in the middle of the menu bar, like what you see
Just to the right of the "Execute" button is an option to display the query execution plan.
(It's highlighted yellow in the middle of the image above.) Once you do so, the query execution plan will appear in a panel
on the bottom of SSMS, as shown [here](https://courses.cs.washington.edu/courses/cse414/17sp/hw/hw3/ssms-execute.png). We will discuss query plans in class and you may find it useful to examine them.
Now you are ready to move on to the next part of the assignment!
### B. Ingesting Data (0 points)
Next, you will import all the data from HW2. Make sure that you execute your
`CREATE TABLE` statements first so that the tables you will add tuples to already exist.
Also, make sure that the types of the columns in the tables you created match the data.
To import data, you will need to use a utility called `bcp`, which should come with the
command prompt on Windows. (Note that you can also use a Windows machine in the lab or a
Windows VM instance on Azure.) If you are using a unix system, you can use the freebcp utility,
which is a part of [freetds](http://www.freetds.org/userguide/).
On a Mac, you can install freebcp using the homebrew package manager:
```bash
brew install freetds
```
Make sure that homebrew installs freetds version at least 0.95.79.
(You may need to run `brew update` and `brew upgrade` to get the most recent versions.)
The command to load `[file_name]` into the table `[table_name]` with `bcp` is:
Here is an example with server `foosqlserver.database.windows.net`, user `bar`, and database name `cse344-17au`:
```bash
bcp dbo.Flights in flights-small.csv -U bar@foosqlserver -S foosqlserver.database.windows.net -P"some-complicated-password"-c-t","-r"0x0a"-d cse344-17au
```
Usage of `freebcp` is similar to bcp except you need to use `-D` instead of `-d` when
specifying the database and you need to use `\n` instead of `0x0a`.
The same example from above becomes:
```bash
freebcp dbo.Flights in flights-small.csv -U bar@foosqlserver -S foosqlserver.database.windows.net -P"some-complicated-password"-c-t","-r"\n"-D cse344-17au
```
**Note:** When loading the `flights-small.csv` table, `bcp` may appear to hang toward the
end; however, it probably did not hang. It is creating the index on the primary key,
which will take what seems like an eternity but is approximately 20 min, and then it
will complete.
As a sanity check, count the number of rows in `Flights` and make sure it matches the
number of rows in the SQLite Flights table. If the count on SQL Server is too low, the data didn't all get imported. You will need to drop the table and try again.
If it continues to not work, you can try splitting the `flights-small.csv` file into
multiple smaller files and importing them one at a time. On Mac, to split the file into 10 smaller files, run: