Changes

Ellis Michael · 464359f7
--- a/Lab-1.md
+++ b/Lab-1.md
 # MapReduce
 **Due: Wed Jan 13, 9:00pm**
-*Modified from the [MIT 6.824 Labs](http://css.csail.mit.edu/6.824/2014/labs/lab-1.html)*
+*Modified from the [MIT 6.824 Labs][mit-labs]*
 ## Introduction
-In this lab you'll build a MapReduce library as a way to learn the Go programming language and as a way to learn about fault tolerance in distributed systems. In the first part you will write a simple MapReduce program. In the second part you will write a Master that hands out jobs to workers, and handles failures of workers. The interface to the library and the approach to fault tolerance is similar to the one described in the original [MapReduce paper](http://research.google.com/archive/mapreduce-osdi04.pdf).
+In this lab you'll build a MapReduce library as a way to learn the Go
+programming language and as a way to learn about fault tolerance in distributed
+systems. In the first part you will write a simple MapReduce program.  In the
+second part you will write a Master that hands out jobs to workers, and handles
+failures of workers.  The interface to the library and the approach to fault
+tolerance is similar to the one described in the original
+[MapReduce paper][mapreduce-pdf].
 ### Collaboration Policy
-You must write all the code you hand in for 452, except for code that we give you as part of the assignment. You are not allowed to look at anyone else's solution, and you are not allowed to look at code from previous years. You may discuss the assignments with other students, but you may not look at or copy each others' code. Please do not publish your code or make it available to future 452 students -- for example, please do not make your code public on github.
-Undergrads taking 452 may do the labs with a partner. Masters students should complete the labs individually.
+You must write all the code you hand in for 452, except for code that we give
+you as part of the assignment. You are not allowed to look at anyone else's
+solution, and you are not allowed to look at code from previous years. You may
+discuss the assignments with other students, but you may not look at or copy
+each others' code. Please do not publish your code or make it available to
+future 452 students -- for example, please do not make your code public on
+GitHub.
+Undergrads taking 452 may do the labs with a partner. Masters students should
+complete the labs individually.
 ### Software
-You'll implement this lab (and all the labs) in [Go 1.5](http://www.golang.org/) (the last released version). This version is available as packages for Linux and Mac OSX through [MacPorts](https://www.macports.org/). You can also download binaries from the Go web site.
-The Go web site contains lots of tutorial information which you may want to look at. We supply you with a non-distributed MapReduce implementation, and a partial implementation of a distributed implementation (just the boring bits).
+You'll implement this lab (and all the labs) in [Go >=1.5][go]. This version is
+available as packages for Linux and Mac OSX through [MacPorts][macports]. You
+can also download binaries from the Go web site.
-You'll fetch the initial lab software with git (a version control system). To learn more about git, take a look at the [git user's manual](http://www.kernel.org/pub/software/scm/git/docs/user-manual.html), or, if you are already familiar with other version control systems, you may find this [CS-oriented overview of git](http://eagain.net/articles/git-for-computer-scientists/) useful. 
+The Go web site contains lots of tutorial information which you may want to look
+at. We supply you with a non-distributed MapReduce implementation, and a partial
+implementation of a distributed implementation (just the boring bits).
-We'll be using our new department git server, which is hosted at `gitlab.cs.washington.edu`. This is basically a local deployment of GitHub, so if you have used GitHub before, you are all set! Before downloading the software, you'll need to sign in and upload your ssh key. A handy tutorial on how to generate and upload your ssh key is available from [GitHub](https://help.github.com/articles/generating-ssh-keys/).
+You'll fetch the initial lab software with git (a version control system). To
+learn more about git, take a look at the [git user's manual][git-manual], or, if
+you are already familiar with other version control systems, you may find this
+[CS-oriented overview of git][cs-git] useful.
-```
+We'll be using our new department git server, which is hosted at
+`gitlab.cs.washington.edu`. This is basically a local deployment of GitHub, so
+if you have used GitHub before, you are all set! Before downloading the
+software, you'll need to sign in and upload your ssh key. A handy tutorial on
+how to generate and upload your ssh key is available from [GitHub][github-ssh].
+```sh
 $ git clone git@gitlab.cs.washington.edu:dwoos/452-labs.git 452-labs
 $ cd 452-labs
 $ ls
-src
+src README.md
-$ 
 ```
-Git allows you to keep track of the changes you make to the code. For example, if you want to checkpoint your progress, you can commit your changes by running:
+Git allows you to keep track of the changes you make to the code. For example,
+if you want to checkpoint your progress, you can commit your changes by running:
-```
+```sh
 $ git commit -am 'partial solution to lab 1'
-$ 
 ```
-If you would like to host your code on gitlab as well, the easiest way is to fork the 452-labs repo and make your edits to your fork. [Instructions on how to fork a repo.](https://help.github.com/articles/fork-a-repo/) If you do fork the repo, do not forget to keep your fork synced! We may be adding hints to the code throughout the quarter.
+If you would like to host your code on gitlab as well, the easiest way is to
+fork the 452-labs repo and make your edits to your fork. [Instructions on how to
+fork a repo.][fork-rep] If you do fork the repo, do not forget to keep your fork
+synced! We may be adding hints to the code throughout the quarter. Also, be sure
+to make your repo private.
 ### Getting started
-There is an input file `kjv12.txt` in `~/452-labs/src/main`, which was downloaded from [here](https://web.archive.org/web/20130530223318/http://patriot.net/~bmcgin/kjv12.txt). Compile the initial software we provide you and run it with the downloaded input file:
+There is an input file `kjv12.txt` in `~/452-labs/src/main`, which was
+downloaded from [here][kjv]. Compile the initial software we provide you and run
+it with the downloaded input file:
 ```sh
-$ export GOPATH=$HOME/452-labs
+$ export GOPATH=$HOME/452-labs    # Or, the location of the repo's root directory
 $ cd ~/452-labs/src/main
 $ go run wc.go master kjv12.txt sequential
 # command-line-arguments
@@ -49,12 +85,16 @@ $ go run wc.go master kjv12.txt sequential
 ./wc.go:15: missing return at end of function
 ```
-The compiler produces two errors, because the implementation of the `Map` and `Reduce` functions is incomplete.
+The compiler produces two errors, because the implementation of the `Map` and
+`Reduce` functions is incomplete.
 ## Part I: Word count
-Modify `Map` and `Reduce` so that `wc.go` reports the number of occurrences of each word in alphabetical order.
-```
+Modify `Map` and `Reduce` so that `wc.go` reports the number of occurrences of
+each word in alphabetical order.
+```sh
 $ go run wc.go master kjv12.txt sequential
 Split kjv12.txt
 DoMap: read split mrtmp.kjv12.txt-0 966954
@@ -82,9 +122,9 @@ Merge: read mrtmp.kjv12.txt-res-1
 Merge: read mrtmp.kjv12.txt-res-2
 ```
-The output will be in the file "mrtmp.kjv12.txt". Your implementation is correct if the following command produces the following top 10 words:
+The output will be in the file "mrtmp.kjv12.txt". Your implementation is correct
+if the following command produces the following top 10 words:
-```
+```sh
 $ sort -n -k2 mrtmp.kjv12.txt | tail -10
 unto: 8940
 he: 9666
@@ -97,62 +137,123 @@ of: 34434
 and: 38850
 the: 62075
 ```
-To make testing easy for you, run:
-```
+To make testing easy for you, run:
+```sh
 $ ./test-wc.sh
 ```
 and it will report if your solution is correct or not.
-Before you start coding read Section 2 of the [MapReduce paper](http://research.google.com/archive/mapreduce-osdi04.pdf) and our code for mapreduce, which is in `mapreduce.go` in package `mapreduce`. In particular, you want to read the code of the function `RunSingle` and the functions it calls. This well help you to understand what MapReduce does and to learn Go by example.
+Before you start coding read Section 2 of the [MapReduce paper][mapreduce-pdf].
+Your `Map()` and `Reduce()` functions will differ a bit from those in the
+paper's Section 2.1. Your `Map()` will be passed some of the text from the file;
+it should split it into words, and return a `list.List` of key/value pairs, of
+type` mapreduce.KeyValue`. Your `Reduce()` will be called once for each key,
+with a list of all the values generated by `Map()` for that key; it should
+return a single output value.
+It will help to read our code for mapreduce, which is in `mapreduce.go` in
+package mapreduce. Look at `RunSingle()` and the functions it calls. This well
+help you to understand what MapReduce does and to learn Go by example.
 Once you understand this code, implement `Map` and `Reduce` in `wc.go`.
-**Hint:** you can use [strings.FieldsFunc](http://golang.org/pkg/strings/#FieldsFunc) to split a string into components. The following lambda expression will be useful for checking whether a character is a letter: 
+**Hint:** you can use [strings.FieldsFunc][fields-func] to split a string into
+components. The following lambda expression will be useful for checking whether
+a character is a letter:
 ```go
 func(r rune) bool { return !unicode.IsLetter(r) }
 ```
-**Hint:** for the purposes of this exercise, you can consider a word to be any contiguous sequence of letters, as determined by [unicode.IsLetter](http://golang.org/pkg/unicode/#IsLetter). A good read on what strings are in Go is the [Go Blog on strings](http://blog.golang.org/strings).
+**Hint:** for the purposes of this exercise, you can consider a word to be any
+contiguous sequence of letters, as determined by [unicode.IsLetter][isletter]. A
+good read on what strings are in Go is the [Go Blog on strings][strings].
-**Hint:** The [strconv package](http://golang.org/pkg/strconv/) is handy to convert strings to integers etc.
+**Hint:** The [strconv package][strconv] is handy to convert strings to integers
+etc.
-**Hint:** The solution to this part of the lab should take you about 10 lines of code. If your code is much >100 LoC, you might want to discuss your design with a TA.
+**Hint:** The solution to this part of the lab should take you about 10 lines of
+code. If your code is much >100 LoC, you might want to discuss your design with
+a TA.
 You can remove the output file and all intermediate files with:
-```
+```sh
 $ rm mrtmp.*
 ```
-## Part II: Distributing MapReduce jobs
-In this part you will design and implement a master who distributes jobs to a set of workers. We give you the code for the RPC messages (see `common.go` in the `mapreduce` package) and the code for a worker (see `worker.go` in the `mapreduce` package).
-Your job is to complete `master.go` in the `mapreduce` package. In particular, the `RunMaster()` function in `master.go` should return only when all of the map and reduce tasks have been executed. This function will be invoked from the `Run()` function in `mapreduce.go`.
+## Part II: Distributing MapReduce jobs
-The code in `mapreduce.go` already implements the `MapReduce.Register` RPC function for you, and passes the new worker's information to `mr.registerChannel`. You should process new worker registrations by reading from this channel.
-Information about the MapReduce job is in the `MapReduce` struct, defined in `mapreduce.go`. Modify the `MapReduce` struct to keep track of any additional state (e.g., the set of available workers), and initialize this additional state in the `InitMapReduce()` function. The master does not need to know which `Map` or `Reduce` functions are being used for the job; the workers will take care of executing the right code for `Map` or `Reduce`.
-In Part II, you don't have worry about failures of workers. You are done with Part II when your implementation passes the first test set in `test_test.go` in the `mapreduce` package.
-`test_test.go` uses Go's unit testing. From now on all exercises (including subsequent labs) will use it, but you can always run the actual programs from the main directory. You run unit tests in a package directory as follows:
+In this part you will complete a version of MapReduce that splits the work up
+over a set of worker threads, in order to exploit multiple cores. A master
+thread hands out work to the workers and waits for them to finish. The master
+should communicate with the workers via RPC. We give you the worker code
+(`mapreduce/worker.go`), the code that starts the workers, and code to deal with
+RPC messages (`mapreduce/common.go`).
+Your job is to complete `master.go` in the mapreduce package. In particular, you
+should modify `RunMaster()` in `master.go` to hand out the map and reduce jobs to
+workers, and return only when all the jobs have finished.
+Look at `Run()` in `mapreduce.go`. It calls `Split()` to split the input into
+per-map-job files, then calls your `RunMaster()` to run the map and reduce jobs,
+then calls `Merge()` to assemble the per-reduce-job outputs into a single output
+file. `RunMaster` only needs to tell the workers the name of the original input
+file (`mr.file`) and the job number; each worker knows from which files to read
+its input and to which files to write its output.
+Each worker sends a Register RPC to the master when it starts. `mapreduce.go`
+already implements the master's `MapReduce.Register` RPC handler for you, and
+passes the new worker's information to `mr.registerChannel`. Your RunMaster
+should process new worker registrations by reading from this channel.
+Information about the MapReduce job is in the `MapReduce` struct, defined in
+`mapreduce.go`. Modify the `MapReduce` struct to keep track of any additional
+state (e.g., the set of available workers), and initialize this additional state
+in the `InitMapReduce()` function. The master does not need to know which Map or
+Reduce functions are being used for the job; the workers will take care of
+executing the right code for Map or Reduce.
+You should run your code using Go's unit test system. We supply you with a set
+of tests in `test_test.go`. You run unit tests in a package directory (e.g., the
+mapreduce directory) as follows:
-```
+```sh
+$ cd mapreduce 
 $ go test
 ```
-The master should send RPCs to the workers in parallel so that the workers can work on jobs concurrently. You will find the `go` statement useful for this purpose and the [Go RPC documentation](http://golang.org/pkg/net/rpc/).
+You are done with Part II when your implementation passes the first test (the
+"Basic mapreduce" test) in `test_test.go` in the mapreduce package. You don't
-The master may have to wait for a worker to finish before it can hand out more jobs. You may find channels useful to synchronize threads that are waiting for reply with the master once the reply arrives. Channels are explained in the document on [Concurrency in Go](http://golang.org/doc/effective_go.html#concurrency).
+yet have worry about failures of workers.
-We've given you code that sends RPCs via "UNIX-domain sockets". This means that RPCs only work between processes on the same machine. It would be easy to convert the code to use TCP/IP-based RPC instead, so that it would communicate between machines; you'd have to change the first argument to calls to Listen() and Dial() to "tcp" instead of "unix", and the second argument to a port number like ":5100". You will need a shared distributed file system.
+The master should send RPCs to the workers in parallel so that the workers can
+work on jobs concurrently. You will find the go statement useful for this
-The easiest way to track down bugs is to insert `log.Printf()` statements, collect the output in a file with `go test > out`, and then think about whether the output matches your understanding of how your code should behave. The last step is the most important.
+purpose and the Go RPC documentation.
-You will see some error messages that are safe to ignore. These will looks something like this:
+The master may have to wait for a worker to finish before it can hand out more
-```
+jobs. You may find channels useful to synchronize threads that are waiting for
+reply with the master once the reply arrives. Channels are explained in the
+document on Concurrency in Go.
+The easiest way to track down bugs is to insert `log.Printf()` statements,
+collect the output in a file with `go test > out`, and then think about whether
+the output matches your understanding of how your code should behave. The last
+step (thinking) is the most important.
+The code we give you runs the workers as threads within a single UNIX process,
+and can exploit multiple cores on a single machine. Some modifications would be
+needed in order to run the workers on multiple machines communicating over a
+network. The RPCs would have to use TCP rather than UNIX-domain sockets; there
+would need to be a way to start worker processes on all the machines; and all
+the machines would have to share storage through some kind of network file
+system.
+You will see some error messages that are safe to ignore. These will looks
+something like this:
+```sh
 2016/01/04 11:44:52 method CleanupFiles has wrong number of ins: 1
 2016/01/04 11:44:52 method CleanupRegistration has wrong number of ins: 1
 2016/01/04 11:44:52 method KillWorkers has wrong number of ins: 1
@@ -163,31 +264,76 @@ You will see some error messages that are safe to ignore. These will looks somet
 2016/01/04 11:44:52 method Split has wrong number of ins: 2
 2016/01/04 11:44:52 method StartRegistrationServer has wrong number of ins: 1
 ```
-The important thing to look for is a `PASS` at the end of your output indicating that your implementation has passed all of the unit tests.
+The important thing to look for is a `PASS` at the end of your output indicating
+that your implementation has passed all of the unit tests.
+**Hint:** Use a `select` to check for new worker registrations as well as
+checking for finished workers that need new jobs.
-**Hint:** Use a `select` to check for new worker registrations as well as checking for finished workers that need new jobs.
+**Hint:** Think about how you might handle jobs that do not finish and have to
+be restarted. This will limit how much re-design you will have to do in the next
+section when you have to handle worker failures.
-**Hint:** When designing, think about how you might handle jobs that do not finish and have to be restarted. This will limit how much re-design you will have to do in the next section when you have to handle worker failures.
 ## Part III: Handling worker failures
-In this part you will make the master handle workers failures. computing a job. In MapReduce handling failures of workers is relatively straightforward, because the workers don't have persistent state. If the worker fails, any RPCs that the master issued to that worker will fail (e.g., due to a timeout). Thus, if the master's RPC to the worker fails, the master should re-assign the job given to the failed worker to another worker.
+In this part you will make the master handle failed workers. MapReduce makes
+this relatively easy because workers don't have persistent state. If a worker
+fails, any RPCs that the master issued to that worker will fail (e.g., due to a
+timeout). Thus, if the master's RPC to the worker fails, the master should
+re-assign the job given to the failed worker to another worker.
-An RPC failure doesn't necessarily mean that the worker failed; the worker may just be unreachable but still computing. Thus, it may happen that two workers receive the same job and compute it. However, because jobs are idempotent, it doesn't matter if the same job is computed twice---both times it will generate the same output. So, you don't have to anything special for this case. (Our tests never fail workers in the middle of job, so you don't even have to worry about several workers writing to the same output file.)
+An RPC failure doesn't necessarily mean that the worker failed; the worker may
+just be unreachable but still computing. Thus, it may happen that two workers
+receive the same job and compute it. However, because jobs are idempotent, it
+doesn't matter if the same job is computed twice---both times it will generate
+the same output. So, you don't have to anything special for this case. (Our
+tests never fail workers in the middle of job, so you don't even have to worry
+about several workers writing to the same output file.)
-You don't have to handle failures of the master; we will assume it won't fail. Making the master fault-tolerant is more difficult because it keeps persistent state that must be replicated to make the master fault tolerant. Keeping replicated state consistent in the presence of failures is challenging. Much of the later labs are devoted to this challenge.
+You don't have to handle failures of the master; we will assume it won't fail.
+Making the master fault-tolerant is more difficult because it keeps persistent
+state that would have to be recovered in order to resume operations after a
+master failure. Much of the later labs are devoted to this challenge.
-Your implementation must pass the two remaining test cases in `test_test.go`. The first case tests the failure of one worker. The second test case tests handling of many failures of workers. Periodically, the test cases starts new workers that the master can use to make forward progress, but these workers fail after handling a few jobs. 
+Your implementation must pass the two remaining test cases in test_test.go. The
+first case tests the failure of one worker. The second test case tests handling
+of many failures of workers. Periodically, the test cases start new workers that
+the master can use to make forward progress, but these workers fail after
+handling a few jobs.
+**Hint:** The solution to Parts II and III of the lab should be around 60 lines
+of code.
-**Hint:** The solution to Parts II and III of the lab should be around 60 lines of code.
 ## Submission Instructions
 Make sure that you have done the following:
 * COMMENT your code! We should be able to understand what your code is doing.
-* Make sure all of your code passes the test cases. Do not modify them, as we will be replacing all `test_test.go` files before grading.
+* Make sure all of your code passes the test cases. Do not modify them, as we
-* Add a README.lab1 in 452-labs/src with:
+  will be replacing all `test_test.go` files before grading.
+* Add a `README.lab1` in `452-labs/src` with:
    * Your name
    * Your partner's name (if you had one)
    * How many hours you each spent on the lab.
    * A high-level description of your design.
-* See [Submission Requirements](submission) for specific format requirements for the file you will upload to the Catalyst dropbox 
+* See [Submission Requirements][submission] for specific format requirements for
\ No newline at end of file
+  the file you will upload to the Catalyst dropbox
+[mit-labs]: http://css.csail.mit.edu/6.824/2015/labs/lab-1.html
+[mapreduce-pdf]: http://research.google.com/archive/mapreduce-osdi04.pdf
+[go]: http://www.golang.org
+[macports]: https://www.macports.org/
+[git-manual]: http://www.kernel.org/pub/software/scm/git/docs/user-manual.html
+[cs-git]: http://eagain.net/articles/git-for-computer-scientists/
+[github-ssh]: https://help.github.com/articles/generating-ssh-keys/
+[fork-repo]: https://help.github.com/articles/fork-a-repo/
+[kjv]: https://web.archive.org/web/20130530223318/http://patriot.net/~bmcgin/kjv12.txt
+[fields-func]: http://golang.org/pkg/strings/#FieldsFunc
+[isletter]: http://golang.org/pkg/unicode/#IsLetter
+[strings]: http://blog.golang.org/strings
+[strconv]: http://golang.org/pkg/strconv/
+[rpc]: http://golang.org/pkg/net/rpc/
+[submission]: submission
+[concurrency]: http://golang.org/doc/effective_go.html#concurrency