Original PDF Flash format 4-shared-memory  


4 Shared Memory

Security and Fault-tolerance in Distributed Systems
ETHZ, Spring 2009
Christian Cachin, IBM Zurich Research Laboratory
www.zurich.ibm.com/˜cca/
4
Shared Memory
4.1
Registers
Motivation.
Registers or read/write registers are a simple and useful abstraction for shared
data storage. In the so-called shared-memory model, processes access concurrent data objects
asynchronously. Wait-free implementations of such objects guarantee that any process can
complete any operation in a finite number of steps, regardless of the execution speeds of the
other processes.
Registers may be used for communication and process synchronization, but because of
their limited operations, objects with richer and more powerful operations have also been
considered, like (binary) test-and-set operations or (multi-valued) read-modify-write opera-
tions [HS08].
In practice, the storage may take the form of shared memory (RAM) in a multiprocessor
system or storage devices (disks) connected to clients over a network.
Definitions.
Registers were formalized by Lamport [Lam86].
Definition 1 (Register). A register x is characterized by two operations:
write(x, v) → ok: writes a value v to register x and returns the symbol ok;
read(x) → v: reads the register x and returns its value v.
W.l.o.g. we consider only one register and assume that every process executes at any time
only one operation. An operation is invoked at some point in time and returns at a later point
in time. When a write operation with value v returns ok, we say that it writes v.
The sequential specification of a register requires that each read operation returns the value
written by the most recent preceding write operation.
Definition 2 (Precedence). For two operations o1 and o2, we say that
• o1 precedes o2 whenever o1 returns before o2 is invoked (they are sequential), and
• o1 is concurrent with o2 when neither operation precedes the other one.
Many variations of registers are considered:
Domain: binary and multi-valued;
Concurrent access: single-reader single-writer (SRSW), multiple-readers single-writer (MRSW),
and multiple-readers multiple-writers (MRMW);
Semantics: safe, regular, and atomic (see next).
1

Semantics.
The most important aspect of a register is its behavior under concurrent access.
W.l.o.g. assume there is an initial write operation.
Safe: A register is safe when every read not concurrent with a write returns the most recently
written value. Reads that are concurrent with at least one write may return any value in
the domain.
Regular: A register is regular if it is safe and any read concurrent with a write returns either
the most recently written value or a concurrently written value.
Atomic: A register is atomic whenever the read and write operations are linearizable, which
means that there exists an equivalent totally ordered sequential execution of them. In
other words, there exists a permutation π of all invocations and responses in the execution
such that the sequential specification of every register holds and such that for any two
operations o1 and o2 where o1 precedes o2 in the execution, o1 also precedes o2 in π.
(For one writer only, a simpler definition is to require that the register is regular and
ensures that if an operation r1 returns a value written by w1, an operation r2 returns a
value written by w2, and r1 precedes r2, then w2 does not precede w1.)
4.2
Reductions among Registers
Most variations of registers are equally powerful in the sense that they can be reduced to each
other in the shared-memory model. In particular, a multi-valued MRMW atomic register can
be implemented with binary SRSW safe registers [HS08].
Algorithm 3 (Emulation of a multi-valued MRSW register from binary MRSW registers).
Given k binary registers b0, b1, . . . , bk−1 with operations read and write, simulate a multi-valued
register B with operations READ and WRITE. The domain of B is {0, . . . , k − 1}. The emula-
tion uses a unary encoding. Initially, all binary registers are 0, except for b0 = 1.
WRITE(B, v):
READ(B):
for i = 0, 1, . . . , k − 1 do
for i = 0, 1, . . . , k − 1 do
if i = v then
if read(bi) = 1 then
write(bi, 1)
return i
else
return 0
write(bi, 0)
return ok
Theorem 4. Algorithm 3 implements a multi-valued MRSW register with safe semantics, but
not regular semantics.
To provide regular semantics, we change only the WRITE algorithm so that the register contains
either the previous or the concurrently written value at any time:
WRITE(B, v):
write(bv, 1)
for i = v − 1, v − 2, . . . , 0 do
write(bi, 0)
return ok
2

Theorem 5. Algorithm 3 with the modified WRITE implements a multi-valued MRSW regular
register from binary MRSW regular registers, and a multi-valued MRSW atomic register from
binary MRSW atomic registers.
4.3
Fault-tolerant Distributed Implementations
Here we consider a fault-tolerant distributed implementation of a register by n storage servers
P = {P1, . . . Pn} in an asynchronous network. Some servers may fail by crashing silently.
A protocol emulates the shared data object despite the failure of some servers. The data is
read and written by clients through sending messages to the servers over an asynchronous
network that provides a reliable point-to-point FIFO channel between every client and every
server. The servers do not communicate with each other. For tolerating faults, the data of the
register is stored collectively by all servers, using replication or erasure coding. We assume that
clients do not fail. Wait-free termination here means that a client completes every operation
independently from server failures and independently of the speed of other clients.
Let Q be a quorum system on P. The next algorithm implements an abstract fault-tolerant
data storage system without a centralized controller. It tolerates the failure of a set P \ Q of
servers for any Q ∈ Q. The clients need only talk to a quorum of servers for accessing the
data.
Algorithm 6 (Distributed implementation of a MRSW regular register). A register x stores
a value v; clients are a single reader and a multiple writers. The writer maintains a timestamp τ .
Every server Pi stores local copies vi and τi.
• For writing v to x, the writer increments τ , and executes the following steps: it picks
a quorum Q, sends (write, v, τ ) to all Pi ∈ Q, and waits for an ack from all servers
in Q; if not enough ack messages arrive in time, the writer repeats those steps with
a different quorum until ack messages arrive from all servers in the quorum. Then it
outputs ok and terminates the write.
Upon receiving (write, v, τ ), a server Pi sets (vi, τi) ← (v, τ ) and returns an ack
message to the client.
• To read from x, a reader executes the following steps: it picks a quorum Q, sends a
(read) message to all Pi ∈ Q, and waits for a message (value, vi, τi) from all servers
in Q; if not enough value messages arrive in time, the reader repeats those steps with
a different quorum until value messages arrive from all servers in the quorum. Then it
selects the value from the message with the highest timestamp and outputs that value.
Upon receiving the (read) message from a client, a server Pi returns the message
(value, vi, τi) to the client.
The message complexity of every operation is 2|Q|.
Theorem 7. Algorithm 6 is a wait-free implementation of a MRSW regular register on P.
Proof sketch. The quorum used by the reader has non-empty intersection with the quorum used
in the most recent write that precedes the read. If a write exists concurrent with some read, the
reader may also return the concurrently written value. Wait-freedom follows because there
exists a quorum of correct servers.
Algorithm 6 can be modified to emulate an SRSW atomic register: The reader additionally
3

maintains a value/timestamp pair (v, τ ). If the reader receives a value message containing a
higher timestamp than τ , it sets (v, τ ) to the value/timestamp pair from the message. Finally,
the reader outputs v. This emulates an atomic register only for a single reader. When there are
multiple clients C1, C2, . . . reading from the register, synchronizing the value/timestamp pair
between the readers requires an additional step.
The next algorithm is atomic for multiple readers; it synchronizes the reader timestamp by
causing clients to write during a read operation (one can show that this is necessary).
Algorithm 8 (Distributed implementation of a MRSW atomic register [ABND95, AW04]).
Let the writer be Cw; it stores a timestamp τ . Every Pi maintains (vi, τi).
write(x, v):
// executed by writer Cw only
τ ← τ + 1
send (write, v, τ ) to P1, . . . , Pn
wait for an (ack) message from all Pi in some quorum Q ∈ Q
return ok
read(x):
// client Cj
send (read) to P1, . . . , Pn
wait for (value, vi, τi) messages from all Pi in some quorum Q ∈ Q
let (v, τ ) be the received (vi, τi) pair with the largest τi
send (write, v, τ ) to P1, . . . , Pn
wait for an (ack) message from all Pi in some quorum Q ∈ Q
return v
upon receiving (write, v, τ ) from Cw:
// server Pi
if τ > τi then
(vi, τi) ← (v, τ )
send (ack) to Cw
upon receiving (read) from Cj:
// server Pi
send (value, vi, τi) to Cj
Theorem 9. Algorithm 8 implements a MRSW atomic register on a quorum system Q.
Proof sketch. The only problem is a write operation w concurrent with multiple reads, say, r1
and r2. In this case, observe that if r1 → v precedes r2 or w(v) precedes r2, then r2 → v
because r1 and w both write to a quorum that intersects with the quorum from which r2 obtains
its value.
References
[ABND95] H. Attiya, A. Bar-Noy, and D. Dolev, Sharing memory robustly in message-passing
systems, Journal of the ACM 42 (1995), no. 1, 124–142.
[AW04]
H. Attiya and J. Welch, Distributed computing: Fundamentals, simulations and
advanced topics, second ed., Wiley, 2004.
[HS08]
M. Herlihy and N. Shavit, The art of multiprocessor programming, Morgan Kauf-
mann, 2008.
[Lam86]
L. Lamport, On interprocess communication, Distributed Computing 1 (1986),
no. 2, 77–85, 86–101.
4