Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
Winston Jodjana winj@uw.edu
Dixon Tirtayadi dixontir@uw.edu
Mengqi Chen franchen@uw.edu
Write-up
- A detailed description of your group's implementation
client.py - The client file to send lock or unlock operations to a paxos node
paxos.py - The main node file that executes multi-instance Paxos
paxos_utils.py - Misc. utilities used by paxos.py (e.g. ballot num)
message.py - All the possible messages that can be sent between client and/to paxos nodes.
P1A - Prepare
P1B - Promise
P2A - Propose
P2B - Accept
timers.py - All the required timers that help keep track of paxos nodes' states to make sure, for example, nodes aren't dead.
Also have timers for timeout for assuming lost message
lockmanager.py - File that handles all the lock and unlock operations specific to a client
main.py - A utility file to spawn a paxos group with multiple nodes
Some design decisions / details:
Our implementation make single node (paxos.py) serves all 3 roles of proposer, acceptor, and listener.
Any node can accept a request from a client and it will be considered as part of the paxos. Any node is run separately
on its own thread by running the main function of paxos.py on multiple terminal windows, for example. As such, a node
can be easily killed to simulate node failure.
Client sends a request to any of the server. Then, the server will need to pass that along to the leader that acts as distinguished proposer and learner.
This is to ensure liveness as described in Paxos paper. We chose the first leader to be the first node in the paxos group. Subsequent leader will be done through
a leader election by sending a election request and exponential backoff to other nodes when the node detect leader is failing.
We decided that the leader election process will be done using first phase paxos. When a leader receives a prepare request, it will give up its leader spot.
This might happen if another node thought the leader is dead when it does not. All other nodes that receives prepare request will also make the sender their leader.
Some checks is needed, if the prepare request is from a ballot that is lower than our, then we will not give up the leader position.
Detection of failed leader will be done through heartbeat. If a node failed to receive 5 heartbeat in a row from leader, it will assume leader is dead.
Client will also have a timer to resend a message for the scenario that the nodes that it randomly picks turns out to be dead node, and many other scenarios.
When a leader receives the request, it run normal two phase paxos with all the other nodes. Then, when the value is accepted by majority, it is put into a log.
This log will only contains value that has already chosen and the slot it is in to guarantee order of operations. This log will be send with the heartbeat to every other nodes.
As such, all nodes will ensure their log is consistent and learn the chosen value. This also mean that when a node dies and recover, it will learn all the previously chosen values.
LockService is in the form of LockManager, which only handles lock(x) and unlock(x) operation, where x is an int for simplicity. This will be a map where keys is the locked value,
and value is the client who holds the lock.
- Any assumptions you made
Any subsequent request from client are treated as different operations (No at-most-once in server side)
The identity of the Paxos group members are hardcoded
No deadlocks
- How to compile, deploy and run your implementation
Config file to indicate the addresses of the paxos group: config.json
Spawn the paxos group based on the config file using:
python3 main.py
Note: Can you use command kill [node] or run [node] to kill and bring back up paxos nodes. Node starts from 0
e.g. python3 main.py
kill 0
kill 1
run 0
run 1
...
exit
Also, for every command done, The LockManager will print the state of locks so you can see what is the state of Locking
and check that it is consistent. For example:
"Current Locks: {2: ('0.0.0.0', 36086), 1: ('0.0.0.0', 53889)}" means that there are two locks currently acquired,
Lock on value 2 held by client at ('0.0.0.0', 36086)
Lock on value 1 held by client at ('0.0.0.0', 53889)
Start the client and send commands to a random paxos server based on the config file using:
python3 client.py
e.g. python3 client.py
lock 1
lock 2
unlock 1
...
exit
Note that lock can be acquired twice. One client sending lock 1 and lock 1 is ok, and only need single unlock 1 to be unlocked.
- Any issues or scenarios explicitly not considered
At-most-once is not implemented in server side. It is possible for client to resend command and server executes it twice.
- Any other behavior of the system you feel important to discuss (e.g. quirks, interesting behavior, performance characteristics, etc)
We intentionally left some important print statements (Leader failure is detected, ) so that the tester can also see what is happening.
Bonus Implementation:
- Resending message. Use timeout for resending messages that may have been lost.
- Node can be killed and bring back up.