Commit 08bae2a3 authored by Alvin Cheung's avatar Alvin Cheung
Browse files

hw5 released

parent 03b27fc1
# CSE 414 Homework 5: JSON, NoSQL, and AsterixDB
**Objectives:** To practice writing queries over the semi-structured data model.
To be able to manipulate semistructured data in JSON and practice using a NoSQL database system (AsterixDB).
**Assigned date:** Tuesday, April 24, 2018.
**Due date:** Tuesday, May 1, 2018. You have 1 week for this homework.
**Questions:** Post them on the [discussion board](https://piazza.com/washington/spring2018/cse414). Tag your post with "Asterix" on Piazza if it pertains to the syntax of SQL++ / AsterixDB installation problems. Otherwise tag your questions with "HW5."
**What to turn in:**
A single file for each question, i.e., `hw5-q1.sqlp`, `hw5-q2.sqlp` etc in the `submission` directory. It should contain commands executable by SQL++, and should contain comments for text answers (delimited by `--` as in SQL).
**Resources**
- [starter code](https://gitlab.cs.washington.edu/cse414-18sp/cse414-18sp/tree/master/hw5/starter-code): which contains `monidal.adm` (the entire dataset), `country`, `mountain`, and `sea` (three subsets)
- [documentation for AsterixDB](https://asterixdb.apache.org/docs/0.9.2/index.html)
- [mailing list for AsterixDB / SQL++ quetions](https://asterixdb.apache.org/community.html). Sign up on the "users" mailing list and you can post questions subsequently (you can unsubscribe afterwards if you like).
- [AsterixDB draft tutorial](https://courses.cs.washington.edu/courses/cse414/18sp/uwnetid/asterix.pdf). You need to log in with your UW net ID to access this document. As mentioned in class, this is still work in progress. The Asterix folks have asked us not to share this with others outside of this class.
## Assignment Details
In this homework, you will be writing SQL++ queries over the semi-structured data model implemented in [AsterixDB](http://asterixdb.apache.org). Asterix is a new Apache project on building a DBMS over data stored in JSON files.
### A. Setting up AsterixDB (0 points)
1. Download and install AsterixDB in your home VM:
- Download [this zip file](https://drive.google.com/file/d/1lid-Auc6DDRA5d3UwX7yaHs11vff4ZzD/view) and unzip it somewhere in your home directory.
*Note: If you are installing it on your own machine, note that AsterixDB requires Java 8; Java 9 will not work.*
2. Download the assignment files and uncompress them. All of them are JSON data files, you can inspect them using your favorite text editor.
3. Start the server. Go to the [Asterix documentation website](https://asterixdb.apache.org/docs/0.9.2/index.html) and follow the instructions listed under "Option 1: Using NC Services." Follow the instructions under “Quick Start”.
If you use the home VM, `start-sample-cluster.sh` is located in `<directory that you unzipped the file above>/opt/local/bin`. You can start Asterix by first going to the `bin` directory and then run `JAVA_HOME=/ ./start-sample-cluster.sh` (or `start-sample-cluster.bat` on windows, you might see a few extra windows pop up when it starts, you can ignore those).
Running the script might seemingly perform nothing. But if it works then you should be able to open the web interface in your browser, by visiting `http://localhost:19001` as described in the website.
In the web interface, select:
- Query language SQL++
- Output format JSON (lossless)
4. Copy, paste, and edit the `<path to modial.adm` text in the Query box, then press Run:
```sql
DROP DATAVERSE hw5 IF EXISTS;
CREATE DATAVERSE hw5;
USE hw5;
CREATE TYPE worldType AS {auto_id:uuid };
CREATE DATASET world(worldType) PRIMARY KEY auto_id AUTOGENERATED;
LOAD DATASET world USING localfs
(("path"="127.0.0.1:///<path to mondial.adm>, e.g., /home/auser/414/hw5/mondial.adm"),("format"="adm"));
/* Edit the absolute path above to point to your copy of mondial.adm. */
/* Use '/' instead of '\' in a path for Windows. e.g., C:/414/hw5/mondial.adm. */
```
```sql
/* Note: if you type one command at a time, then end it with a ";" */
USE hw5;
```
5. Alternatively, you can also use the terminal to run queries rather than the web interface. After you have started Asterix, put your query in a file (say `hw5-q1.sqlp`), then execute the query by typing the following command in terminal:
```bash
curl -v --data-urlencode "statement=`cat hw5-q1.sqlp`" --data pretty=true http://localhost:19002/query/service
```
This will print the output on the screen. If there is too much output, you can save it to a file
```bash
curl -v --data-urlencode "statement=`cat hw5-q1.sqlp`" --data pretty=true http://localhost:19002/query/service > output.txt
```
You can now view `output.txt` using your favorite text editor.
6. Run, examine, modify these queries. They contain useful templates for the questions on the homework: make sure you understand them.
```sql
-- return the set of countries
USE hw5;
SELECT x.mondial.country FROM world x;
```
```sql
-- return each country, one by one (see the difference?)
USE hw5;
SELECT y as country FROM world x, x.mondial.country y;
```
```sql
-- return just their codes, and their names, alphabetically
-- notice that -car_code is not a legal field name, so we enclose in ` … `
USE hw5;
SELECT y.`-car_code` as code, y.name as name
FROM world x, x.mondial.country y order by y.name;
```
```sql
-- this query will NOT run...
USE hw5;
SELECT z.name as province_name, u.name as city_name
FROM world x, x.mondial.country y, y.province z, z.city u
WHERE y.name='Hungary';
-- ...because some provinces have a single city, others have a list of cities; fix it:
USE hw5;
SELECT z.name as province_name, u.name as city_name
FROM world x, x.mondial.country y, y.province z,
CASE WHEN is_array(z.city) THEN z.city
ELSE [z.city] END u
WHERE y.name='Hungary';
```
```sql
-- same, but return the city names as a nested collection;
-- note correct treatment of missing cities
-- also note the convenient LET construct (see SQL++ documentation)
USE hw5;
SELECT z.name as province_name, (select u.name from cities u)
FROM world x, x.mondial.country y, y.province z
LET cities = CASE WHEN z.city is missing THEN []
WHEN is_array(z.city) THEN z.city
ELSE [z.city] END
WHERE y.name='Hungary';
```
7. To shutdown Asterix, simply run `stop-sample-cluster.sh` in the terminal. If you are using the home VM, this script is located in
`<directory that you unzipped the file above>/opt/local/bin` (or `opt\local\bin\stop-sample-cluster.bat` on windows). If you are using the VM, go to the `bin` directory and then run `JAVA_HOME=/ ./stop-sample-cluster.sh` to shut down Asterix.
### B. Problems (100 points)
**For all questions asking to report free response-type questions, please leave your responses in comments**
Use only the `mondial` dataset for problems 1-9. For problems 10-12 we will ask you to load in extra datasets provided in starter code.
1. Retrieve all the names of all cities located in Peru, sorted alphabetically. Name your output attribute ``cities``. [Result Size: 30 rows]
2. For each country return its name, its population, and the number of religions, sorted alphabetically by country. Name your output attributes ``country``, ``population``, ``num_religions``. [Result Size: 238 rows]
3. For each religion return the number of countries where it occurs; order them in decreasing number of countries. Name your output attributes ``religion``, ``num_countries``. [Result size: 37]
4. For each ethnic group, return the number of countries where it occurs, as well as the total population world-wide of that group. Hint: you need to multiply the ethnicity’s percentage with the country’s population. Use the functions `float(x)` and/or `int(x)` to convert a `string` to a `float` or to an `int`. Name your output attributes ``ethnic_group``, ``num_countries``, ``total_population``. You can leave your final `total_population` as a `float` if you like. [Result Size: 262]
5. Compute the list of all mountains, their heights, and the countries where they are located. Here you will join the "mountain" collection with the "country" collection, on the country code. You should return a list consisting of the mountain name, its height, the country code, and country name, in descending order of the height. Name your output attributes ``mountain``, ``height``, ``country_code``, ``country_name``. [Result Size: 272 rows]
Hint: Some mountains can be located in more than one country. You need to output them for each country they are located in.
6. Compute a list of countries with all their mountains. This is similar to the previous problem, but now you will group the mountains for each country; return both the mountain name and its height. Your query should return a list where each element consists of the country code, country name, and a list of mountain names and heights; order the countries by the number of mountains they contain, in descending order. Name your output attributes ``country_code``, ``country_name``, ``mountain``, ``mountain_height``. [Result Size: 238]
7. Find all countries bordering two or more seas. Here you need to join the "sea" collection with the "country" collection. For each country in your list, return its code, its name, and the list of bordering seas, in decreasing order of the number of seas. Name your output attributes ``country_code``, ``country_name``, ``sea``. [Result Size: 74]
8. Return all landlocked countries. A country is landlocked if it borders no sea. For each country in your list, return its code, its name, in decreasing order of the country's area. Note: this should be an easy query to derive from the previous one. Name your output attributes ``country_code``, ``country_name``, ``area``. [Result Size: 45]
9. For this query you should also measure and report the runtime; it may be approximate (warning: it might run for a while) . Find all distinct pairs of countries that share both a mountain and a sea. Your query should return a list of pairs of country names. Avoid including a country with itself, like in (France,France), and avoid listing both (France,Korea) and (Korea,France) (not a real answer). Name your output attributes ``first_country``, ``second_country``. [Result Size: 7]
10. Create a new dataverse called hw5index, then run the following commands:
```sql
USE hw5index;
CREATE TYPE countryType AS OPEN {
`-car_code`: string,
`-area`: string,
population: string
};
CREATE DATASET country(countryType)
PRIMARY KEY `-car_code`;
CREATE INDEX countryID ON country(`-car_code`) TYPE BTREE;
LOAD DATASET country USING localfs
(("path"="127.0.0.1://<path to country.adm>, e.g., /414/hw5/country.adm"),("format"="adm"));
```
This created the type `countryType`, the dataset `country`, and a `BTREE` index on the attribute `-car_code`, which is also the primary key. Both types are OPEN, which means that they may have other fields besides the three required fields `-car_code`, `-area`, and population.
Create two new types: `mountainType` and `seaType`, and two new datasets, `mountain` and `sea`. Both should have two required fields: `-id` and `-country`. Their key should be autogenerated, and of type `uuid` (see how we did it for the mondial dataset). Create an index of type `KEYWORD` (instead of `BTREE`) on the `-country` field (for both `mountain` and `sea`). Turn in the complete sequence of commands for creating all three types, datasets, and indices (for `country`, `mountain`, `sea`).
Recall from lecture that asterix only allows creating index at top level collection, hence we provide the country, sea, etc collections individually even though their data is already included in mondial.
11. Re-run the query from 9. (“pairs of countries that share both a mountain and a sea”) on the new dataverse `hw5index`. Report the new runtime. [Result Size: 7]
12. Modify the query from 11. to return, for each pair of countries, the list of common mountains, and the list of common seas. Name your output attributes ``first_country``, ``second_country``, ``mountain``, ``sea``. [Result Size: 7]
## Final Warning for HW6!
You will need to have received your AWS credits and your account set up for HW6. If any issues have arisen, we expect you to have been following up with Amazon (and following up with the follow up if you don't hear from them within a few days. After all it's your HW not theirs...)
We won't be asking you to turn in any further "evidence" that you have gotten this done in this HW, but consider this as your final warning. **If you still don't have this resolved when HW6 is released, then you will either need to use up your late days / pay Amazon out of your pocket / receive very few points for HW6.** There will be very little that the staff can do to bail you out given that we have asked you to do this since the second week of the quarter.
## Submission Instructions
Write your answers in a file for each question: `hw5-q1.sqlp`, ... `hw5-q12.sqlp`. Leave your runtime and other responses in comments.
**Important**: To remind you, in order for your answers to be added to the git repo,
you need to explicitly add each file:
```sh
$ git add hw5-q1.sqlp ...
```
**Again, just because your code has been committed on your local machine does not mean that it has been
submitted -- it needs to be on GitLab!**
Use the same bash script `turnIn_Hw5.sh` in the root level directory of your repository that
commits your changes, deletes any prior tag for the current lab, tags the current commit,
and pushes the branch and tag to GitLab.
If you are using the home VM or Mac OSX, you should be able to run the following:
```sh
$ ./turnIn_Hw5.sh
```
Like previous assignments, make sure you check the results afterwards to make sure that your file(s)
have been committed.
# Additional Help for SQL++ (HW5)
The [AsterixDB documentation on SQL++](https://asterixdb.apache.org/docs/0.9.3/sqlpp/manual.html) is a very valuable resource. This guide is only intended to give more tangible examples and explanation as to how SQL++ works and point out probably useful features of SQL++ for the homework. Completing this HW is not at all dependent on the material covered here. You are allowed to use the entirety of SQL++ to complete the assignment.
## Understanding Your Data (`mondial.adm`)
With this assignment it can be easy to accidentally step into the assignment without fully understanding what the starter code is doing. If you do not understand the starter code you may struggle with this assignment when errors seem to appear out of nowhere. The best place to start is to make sure you know what data you are processing.
In your starter code folder, you should have the `mondial.adm` file. This is a very large text file (80000+ lines), so provided is a top view of what you are dealing with:
```json
{
"mondial": {
"country": [ ... ],
"continent": [ ... ],
"organization": [ ... ],
"sea": [ ... ],
"river": [ ... ],
"lake": [ ... ],
"island": [ ... ],
"mountain": [ ... ],
"desert": [ ... ]
}
}
```
In addition, your homework spec provides the following code block to correctly setup the data for the majority of the questions:
```sqlp
DROP DATAVERSE hw5 IF EXISTS;
CREATE DATAVERSE hw5;
USE hw5;
CREATE TYPE worldType AS {auto_id:uuid };
CREATE DATASET world(worldType) PRIMARY KEY auto_id AUTOGENERATED;
LOAD DATASET world USING localfs (("path"="127.0.0.1://<path to mondial.adm>"),("format"="adm"));
```
When you run the above code, what you are doing is the following:
* Creating a dataverse called "`hw5`" (SQL++ speak for a database instance)
* Creating a generic type called "`worldType`". This type is `OPEN` by default, meaning that when we eventually load in data, the "mondial" key in `mondial.adm` will be allowed.
* Creating a dataset called "`world`" which is a collection that will store our to-be single instance of a `worldType`.
* Loading in the local file
It is important moving forward to the actual SQL++ queries that we think about the data in a hierarchical way. Again, we can look at this top view in how we want to access data.
```txt
---------
| world |
---------
|
------------
| |
----------- --------
| mondial | | uuid |
----------- --------
|
-----------------------------------------
| | | | | |
----------- ------------ -------
| country | | mountain | | sea | ...
----------- ------------ -------
| | |
```
## Understanding the Example Queries
The first query returns the value stored under country.
```sqlp
USE hw5;
SELECT X.mondial.country
FROM world AS X;
```
This is the entire list of countries. Notice that the key-value pair (`"country" : [ ... ]`) is stored as an object. This holds for all other SQL++ queries, so all output intances are their own objects.
---
The next query returns the same thing as before, but now each output object is a single country.
```sqlp
USE hw5;
SELECT Y AS country
FROM world AS X, X.mondial.country AS Y;
```
Why does SQL++ do this? The idea is that when we specify `X.mondial.country AS Y` in the `FROM` clause, we have now accessed a collection (more specifically an array) of countries. The semantics of having a collection in the `FROM` clause is that SQL++ will iterate over all elements in the array.
---
The thrid query given is a copy of the second, except that we grab specific pieces of information from each country.
```sqlp
USE hw5;
SELECT Y.`-car_code` AS code, Y.name AS name
FROM world AS X, X.mondial.country AS Y
ORDER BY Y.name;
```
This is perhaps one of the most unassuming parts of SQL++. Despite SQL++ working on semi-structured data, the array `X.mondial.country` is treated like a table in the `SELECT` clause. Like how normal SQL has the relational semantics of tables and rigid attributes, SQL++ does the same by treating collections like tables and keys like attributes.
In that same vein of SQL++ being very similar to SQL, most, if not all, of the keywords in SQL are supported via implementation (like `ORDER BY`) or are syntactic sugar (like the `count` function).
---
The next query introduces the `CASE` expression.
```sqlp
USE hw5;
SELECT Z.name AS province_name, U.name AS city_name
FROM world AS X, X.mondial.country AS Y, Y.province AS Z,
(CASE WHEN is_array(Z.city)
THEN Z.city
ELSE [Z.city] END) AS U
WHERE Y.name = 'Hungary';
```
`CASE` expressions are quite literally if-loops and can go almost anywhere in your query. This is very useful when dealing with heterogeneity. When used in the `FROM` clause, the most common use of `CASE` is to rectify data to take an array format so SQL++ can interpret your query (otherwise a TypeMismatchException will occur).
---
In the final example query, the `LET` keyword is used for the first time.
```sqlp
USE hw5;
SELECT Z.name AS province_name, (SELECT cities.name FROM cities) AS cities
FROM world AS X, X.mondial.country AS Y, Y.province AS Z
LET cities = (CASE WHEN Z.city IS MISSING
THEN []
WHEN is_array(Z.city)
THEN Z.city
ELSE [Z.city] END)
WHERE Y.name = 'Hungary';
```
Using `LET` is a simple and concise way to create virtual "tables". The reason why SQL++ included this was to deal with the difficulties of hierarchical data access as evidenced in this example where the "virtual table" was built from items already in the `FROM` clause. Note that we could not generate similar logic easily or concisely with a `FROM` or `WITH` subquery. Notice that in the example above, the `CASE` expression in the `LET` clause uses `Z` which is data derived in the `FROM` clause (something we cannot do in `FROM` or `WITH`). Note that the `cities` "virtual table" is also correlated to the specific value of `Z`, a useful property when extracting output.
## New Keywords in SQL++
SQL++ boasts a huge increase in functionality over standard SQL. With that functionality comes new components to the language that you should become familiar with.
* \` `...` \` Escape character called backtick for accessing keys with characters like `-` or `#` in them.
* `MISSING` A reserved keyword like `NULL`. Can be used in comparisons like `x IS NOT MISSING`.
* `is_array( ... )` A function to check if the value is an array or not.
* `split( s , d )` A function to split a string `s` by the delimiter `d`. Needed for aquiring country codes that are listed together in a single string.
* `CASE WHEN ... THEN ... [WHEN ... THEN ...] ELSE ... END` An expression to output values based on `WHEN` predicates.
* `LET x = ...` Like a `WITH` subquery, but with the ability to utilize the data specified in the `FROM` clause.
## Aggregations in SQL++
Most all the aggregation in this homework will be counting.
* `coll_count( ... )` Counts the number of elements in an array (including `NULL` and `MISSING`)
* `array_count( ... )` Counts the number of elements in an array (not including `NULL` and `MISSING`)
* `count( ... )` This function is supported as syntactic sugar for `ARRAY_COUNT( (SELECT VALUE 1 FROM $1 AS G) )`
Note that these functions only work on array types. Usually how you can take advantage of these functions is by using `LET` to generate the array you want. For example if we wanted to find the total number of member countries in each organization:
```sqlp
USE hw5;
SELECT T.id, sum(T.num_members) AS total_members
FROM (SELECT Y.`-id` AS id, coll_count(arr) AS num_members
FROM world AS X, X.mondial.organization AS Y,
(CASE WHEN is_array(Y.members)
THEN Y.member
ELSE [Y.members] END) AS Z
LET arr = split(Z.`-country`, ' ')) AS T
GROUP BY T.id;
```
## Notes on how to formulate your solutions
We do not have a style guide for SQL++ besides to follow the conventions of normal SQL. Make sure you make line breaks for most of your keywords (as appropriate). We will be lax with syntax for this HW. Above all, just make your code readable.
Formatting your answers can be hard and grabbing exactly the data you want is not always concise. Don't be afraid to use subqueries wherever you need them. In fact correlated subqueries, if use correctly in this homework will actually make some queries run extremely fast!
---
Code base from Dan Suciu and Alvin Cheung
Written by Jonathan Leang
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
put your .sqlp files in this directory, one file per question.
#!/bin/bash
hw=hw5
#check no uncommitted changes.
(git status | grep -q modified:) && echo 'Error. There are uncommitted changes in your working directory. You can do "git status" to see them.
Please commit or stash uncommitted changes before submitting' && exit 1
COMMIT=$(git log | head -n 1 | cut -b 1-14)
if (git tag $hw 2>/dev/null)
then
echo "Created tag '$hw' pointing to $COMMIT"
else
git tag -d $hw && git tag $hw
echo "Re-creating tag '$hw'... (now $COMMIT)"
fi
echo "Now syncing with origin..."
git push origin --mirror #--atomic
echo "Please verify in gitlab that your tag '$hw' matches what you expect. "
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment