Original PDF Flash format using-speculation-to-simplify-multiprocessor-design  


Using Speculation To Simplify Multiprocessor Design

Appears in the Proceedings of the International Parallel and Distributed Processing Symposium
Santa Fe, New Mexico, April 27-29, 2004
Using Speculation to Simplify Multiprocessor Design
Daniel J. Sorin1, Milo M. K. Martin2, Mark D. Hill3, David A. Wood3
1Dept. of Elec. & Comp. Engineering
2Dept. of Comp. & Information Science
3Computer Sciences Dept.
Duke University
University of Pennsylvania
University of Wisconsin—Madison
sorin@ee.duke.edu
milom@cis.upenn.edu
{markhill,david}@cs.wisc.edu
Abstract
cache coherence protocols are prone to infrequent timing
races that exercise difficult-to-test corner cases. As another
Modern multiprocessors are complex systems that often
example, deadlock avoidance presents a challenge in the
require years to design and verify. A significant factor is
design of a multiprocessor memory system. The standard
that engineers must allocate a disproportionate share of
solution avoids deadlock in the interconnection network
their effort to ensure that rare corner-case events behave
by using virtual channels [7], which increases design and
correctly. This paper proposes using “speculation for sim-
verification complexity. However, even without virtual
plicity” to enable designers to focus on common-case sce-
channels, deadlock occurs rarely. We would like to be able
narios. Our approach is to speculate that rare events will
to speculate that rare scenarios, such as corner cases in
not occur and rely on an efficient recovery mechanism to
cache coherence protocols and deadlocks in interconnects,
undo the effects of mis-speculations.
will not occur and recover when they do.
We illustrate the potential of speculation to simplify
Designers have already discovered the potential of
multiprocessor design with three examples. First, we sim-
speculation to simplify design, but they have only applied
plify the design of a directory cache coherence protocol by
it within the processor core. The key has been to leverage
speculatively relying on point-to-point ordering of mes-
the existing mis-speculation recovery mechanism in the
sages in an adaptively routed interconnection network.
dynamically scheduled core. Several processors resort to a
Second, we simplify a snooping cache coherence protocol
pipeline squash on rare, complicated instructions, such as
by treating a rare coherence state transition as a mis-spec-
the Intel Pentium Pro’s manipulation of control registers
ulation. Third, we simplify interconnection network design
[11]. Similarly, the Pentium 4 uses recovery to handle cor-
by removing the virtual channels and then recovering from
ner case deadlocks in the scheduler [5]. These processors
deadlocks when they occur.
are speculating that certain instructions or races are rare.
Experiments with full-system simulation and commer-
The enabling technology for speculation is fast and effi-
cial workloads show that speculation is a viable approach
cient recovery; otherwise, even infrequent mis-specula-
for simplifying system design. Systems can incur as many
tions will unacceptably degrade performance. Modern
as ten recoveries per second due to mis-speculations with-
dynamically scheduled processors provide such mecha-
out significantly degrading performance, and our specula-
nisms in the uniprocessor core, allowing speculation
tively simplified designs incur far fewer recoveries.
within a single thread of execution. More recently, there
have been several proposals for fast and efficient system-
1 Introduction
wide checkpoint/recovery of multiprocessors in hardware
Shared memory multiprocessors are complicated sys-
[19, 17]. These mechanisms capture a consistent global
tems that are difficult to design. Verifying that these
state, allowing speculation between multiple processors.
designs are correct is even more difficult. As one example,
Future multiprocessor systems will likely use such system-
wide checkpoint/recovery to tolerate the increasing fre-
quency of transient hardware faults in emerging sub-
micron technologies. We seek to exploit such a mechanism
This work is supported in part by the National Science Foundation
(grants: CCR-0309164, CCR-0324878, EIA-9971256, EIA-0205286, and
to simplify the design and verification of multiprocessors.
CCR-0105721), an Intel Graduate Fellowship and a Warren Faculty
Scholarship (Sorin), a Norm Koo Graduate Fellowship and an IBM Grad-
In this paper, we propose using “speculation for sim-
uate Fellowship (Martin), two Wisconsin Romnes Fellowships (Hill and
Wood), Spanish Secretaría de Estado de Educación y Universidades (Hill
plicity” to simplify multiprocessor design. The corner-
sabbatical), and donations from Compaq Computer Corporation, Intel
stone of our philosophy is to allocate design and
Corporation, IBM, and Sun Microsystems. Profs. Hill and Wood have
significant financial interests in Sun Microsystems.
verification effort towards common-case, performance-
1

Table 1. Using the framework to characterize three speculative designs
Applications of Speculation for Simplicity
Simplify directory protocol
Simplify snooping protocol
Simplify interconnection network
by speculating on point-to-point
by treating corner case transition as by removing virtual channel flow control
ordering (Section
3.1)
error (Section
3.2)
(Section
4)
(1) Infre-
re-orderings are rare and most re-
writebacks do not often race with
worst-case buffering requirements are
quency of mis- orderings do not matter
requests to write the block
rarely needed in practice
speculation
(2) Detection
one specific invalid transition in
one specific invalid transition in
timeout on cache coherence
protocol controller
protocol controller
transaction
(3) Recovery
SafetyNet
SafetyNet
SafetyNet
(4) Forward
selectively disable adaptive rout-
slow-start execution after
slow-start execution after recovery, with
Progress
ing during re-execution
recovery
sufficient buffering during slow-start
Result
simpler protocol with rare mis-
protocol almost never exercises cor- simpler network incurs no deadlocks in
speculations
ner case in practice
practice
critical events rather than rare corner-case events. Such a
tain its performance even when ten recoveries per second
system may predict that it will not encounter a rare event
occur, and our speculative systems incur recoveries less
and speculatively execute based on that assumption. If the
frequently than that.
system detects mis-speculation (i.e., the rare event
In Section

6, we discuss related work in design simplifi-
occurred), it recovers to a consistent pre-speculation state
cation, before concluding in Section
7.
and resumes execution. Speculation can help in situations
in which (a) the design complexity to handle an infrequent
2 Framework for Speculation for Simplicity
event is far worse than that for the common case, (b)
detecting the infrequent event is much easier than handling
Our framework specifies the four features necessary for
it, and (c) the event is infrequent enough that recoveries
supporting speculation for simplicity. Without these fea-
negligibly impact performance.
tures, speculation for simplicity is not viable.
The primary contribution of this work is a framework
(1) Infrequency of mis-speculation. Mis-speculation must
for simplifying multiprocessor design with speculation.
occur sufficiently rarely that the performance overhead of
This framework specifies four features necessary for sup-
recoveries is not prohibitive.
porting speculation for simplicity: (1) infrequency of mis-
(2) Detection of mis-speculations. The system must
speculation, (2) detection of all mis-speculations, (3)
detect all mis-speculations, and detection mechanisms
recovery from mis-speculation, and (4) guaranteed forward
must not be so complex as to offset gains from speculation.
progress. We present this framework in more detail in
Section
2.
(3) Recovery. To recover from mis-speculation, we need a
recovery mechanism that (a) incurs low runtime overhead
We develop three concrete examples to illustrate how
during mis-speculation-free execution and (b) can recover
we use this framework to simplify the two most compli-
the system to a pre-speculation state. While such a scheme
cated parts of multiprocessor system design: the cache
may be too expensive to implement strictly for purposes of
coherence protocol and the interconnection network. In
speculation for simplicity, we can leverage the same recov-
Section

3, we show how to simplify coherence protocols,
ery mechanism used to improve system availability. For all
with an example of a directory protocol and a snooping
three of our examples, we use SafetyNet [19], a recently
protocol. In Section

4, we show how to simplify the inter-
developed hardware mechanism for recovering the state of
connection network by removing the virtual channels.
a multiprocessor system, although other schemes exist
While these examples are neither exhaustive nor applicable
[17]. Since our speculation extends outside the processor
to all multiprocessors, they do reveal the potential to sim-
core, the recovery scheme must encompass the entire mem-
plify system designs.
ory system, including the caches, cache coherence state,
In Section

5, we evaluate our speculative designs with
and the memory. Periodically, SafetyNet logically check-
full-system simulation and commercial workloads. Results
points the state of the shared memory system and, on mis-
show that speculation is a viable technique for simplifying
speculation, it allows the system to recover to a previous
system design. In general, a speculative system can main-
checkpoint. Checkpoints can span hundreds of thousands
2

of cycles and tolerate long detection latencies. SafetyNet
theorem proving. Testing is a valuable complement to for-
efficiently checkpoints the multiprocessor state, using only
mal verification techniques, and directed testing or ran-
hardware, by incrementally logging changes to the cache
domized testing [3, 23] can uncover many bugs.
and memory state. The system recovers by undoing the
Unfortunately, the complexity of protocols is often due to
logged changes. SafetyNet uses the standard solutions to
subtle race conditions, especially those that are infrequent
the output commit problem (i.e., waiting to verify data
and thus less likely to be uncovered by testing.
before sending it to I/O devices) and the input commit
problem (i.e., logging data received from I/O devices).
3.1 Simplifying a Directory Protocol
(4) Forward progress. We must ensure that, even in the
We now demonstrate how to simultaneously achieve (a)
worst-case mis-speculation scenario, the system continues
the design simplicity of a directory cache coherence proto-
to make forward progress. Mis-speculation must not fall
col that relies on ordering in the interconnection network,
victim to a pathological situation, whether unintentional or
and (b) the benefits of adaptive routing in the interconnect.
due to malicious software. As with detection, mechanisms
We can simplify the design of a directory cache coherence
for forward progress should not be overly complex. All
protocol by relying upon the interconnect to provide point-
three of our examples use similar forward progress mecha-
to-point ordering, a property that guarantees that if a source
nisms that are guaranteed to alter the timing of the execu-
sends two messages to a destination, then the messages
tion after system recovery such that the race cannot recur.
arrive in the order in which they were sent. Point-to-point
ordering eliminates certain potential races in the protocol,
In Section
3
and
Section

4, we present three applications
as we discuss later, and handling these races adds design
of speculation for simplicity, and we summarize them in
and verification complexity.
Table

1. First, we simplify the design of a directory proto-
col by speculating that the adaptively routed interconnect
Although point-to-point ordering can simplify the
provides point-to-point ordering of messages. Second, we
coherence protocol, most high-speed interconnect designs
simplify the design of a broadcast snooping cache coher-
do not provide it. Interconnection networks can often
ence protocols by treating a rare corner case as a mis-spec-
achieve greater throughput and performance by using adap-
ulation instead of explicitly designing for it. Third, we
tive routing. Adaptively routed interconnects, such as that
simplify the design of interconnection networks by remov-
of the Alpha 21364 [16], allow two messages from switch
ing the virtual channels used for deadlock avoidance. In all
S1 to switch S2 to take different paths. Adaptive routing
examples, detection and forward progress are easy to
can improve performance by distributing traffic more
implement and thus speculation simplifies system design.
evenly across the interconnect and by enabling messages to
be routed around localized congestion. Adaptive routing
3 Simplifying Cache Coherence Protocols
can also enhance availability by routing messages around
faulty switches. Adaptive routing, however, does not pro-
In this section, we describe two ways to use speculation
vide point-to-point order in the interconnection network. In
to simplify cache coherence protocols. Cache coherence
addition to adaptive routing, the use of reliable link-level
protocols define the behaviors of the cache and memory
retry mechanisms can preclude the provision of point-to-
controllers. Each controller is a finite state machine that has
point ordering.
some number of states (per cache block) and handles some
number of events that can happen to a block. Numerous
We illustrate an example of how adaptive routing can
controllers concurrently interact with each other with
violate point-to-point order in Figure

1, in which a source
respect to many different blocks. While protocols are sim-
node sends two messages to a destination node. The source
ple at a high level, they are much more complicated to
sends message M2 after sending message M1, but M2
design at a low level. Textbooks often abstract protocols
arrives first at the destination. The reversal in arrival order
into a handful of stable states (MOESI) and a handful of
could be due, for example, to higher contention along the
messages that nodes exchange [6]. In reality, though, proto-
path taken by M1. With static routing, both messages
cols have numerous transient states, and messages race
would have followed the same path and arrived in order.
with each other in the interconnect and can arrive in many
Specific Example. We explore a system with a MOSI
different orders.
directory cache coherence protocol and a two-dimensional
Cache coherence protocols are notoriously difficult to
(2D) torus interconnection network. There are four classes
design and verify. The state space explosion problem—an
of messages in the protocol—Request, ForwardedRequest,
exponential function of the number of controllers, memory
Response, and FinalAck—and each class of messages trav-
blocks, and block states—limits the viability of various for-
els on a logically separate interconnection network (i.e.,
mal verification methods [4], such as model checking and
virtual network). There are three types of Request mes-
3

(1) Infrequency of mis-speculation. The routing algo-
SOURCE
rithm, while adaptive, is still unlikely to violate point-to-
NW
1
NE
point ordering (we will show that it reorders <1% of mes-
Switch
message M1
Switch
sages). Moreover, even when it does violate ordering, few
re-orderings impact correctness. Except in the example
2
message M1
message M2
described above, re-ordering does not affect correctness,
4
for several reasons. First, in this protocol, point-to-point
3
ordering is only necessary on one virtual network (the
SW
SE
ForwardedRequest virtual network). Second, ordering only
Switch
Switch
message M2
matters for messages concerning the same block of mem-
DESTINATION
ory. Third, even for messages concerning the same block,
Figure 1. Violating point-to-point order with adaptive
ordering only matters between certain message types. For
routing. The NW Switch sends Message M1 at time 1
example, the directory can send multiple Forwarded-
and then message M2 at time 2 to the SE Switch.
RequestReadOnly messages to the owner of a block, but
Message M2, however, arrives at time 3, which is before
the order in which they arrive does not matter for correct-
message M1 arrives at time 4.
ness. In particular, the situation in which a Writeback-Ack
sages that processors send to directories: races a Forwarded-RequestReadWrite is particularly rare,
RequestReadOnly, RequestReadWrite, and Writeback.
since a block just evicted at one node is unlikely to be
There are four types of ForwardedRequest messages that
actively wanted by another node.
directories send to processors: Forwarded-
(2) Detection. For this speculative system, a mis-specula-
RequestReadOnly, Forwarded-RequestReadWrite, Invali-
tion can only manifest itself as one particular invalid transi-
dation, and Writeback-Ack. There are three types of
tion in a cache coherence controller, so cache controllers
Response messages that processors or directories send to
can detect all illegal message re-orderings. For our race
requesting processors: Data, Ack, or Nack. Processors send
case, a cache without a valid copy that receives a For-
messages to directories on the FinalAck virtual network to
warded-RequestReadWrite determines this situation to be a
coordinate SafetyNet checkpoints, but we do not discuss
mis-speculation and triggers a system recovery. This mis-
them further here.
speculation cannot manifest itself in any other fashion and
We can simplify directory protocol design by relying
thus is easy to detect.
upon point-to-point order (per virtual network, but not
(3) Recovery. We use SafetyNet in all three of our specula-
across virtual networks) to avoid certain race cases. One
tively simplified designs.
common example of these races occurs when the owner of
(4) Forward progress. To ensure forward progress, we
a block, processor P1, sends a Writeback to the directory
alter the timing of the execution after recovery. We simply
and another processor, P2, sends a RequestReadWrite for
allow the interconnect to selectively disable adaptive rout-
the same block to the directory. If the RequestReadWrite
ing, so that the system can always make forward progress.
arrives first, the directory then sends a Forwarded-
The choice of when to re-enable adaptive routing provides
RequestReadWrite and a Writeback-Ack to P1. If those
an adjustable knob for setting the worst-case lower bound
messages, which travel on the same virtual network, arrive
on performance. At the conservative extreme, never re-
in the reverse order of that in which they were sent (i.e., the
enabling it would bound performance degradation, com-
Writeback-Ack arrives before the Forwarded-
pared to static routing, to the cost of one mis-speculation.
RequestReadWrite), then P1 first sees the Writeback-Ack
and downgrades to Invalid. Thus, it cannot handle the
3.2 Simplifying a Snooping Protocol
incoming Forwarded-RequestReadWrite. Designers of
directory protocols can handle this race, but doing so adds
We now present an example of a protocol race in a
additional states and transitions to the protocol and
broadcast snooping system that the designers (the authors!)
increases the complexity of protocol verification.
did not initially consider. The designers overlooked this
case until weeks later when randomized testing happened
Instead of handling this race by adding extra states and
to uncover it (by crashing the simulator). Instead of forcing
transitions, we implemented a speculative system with an
adaptively routed interconnection network and a directory
cache coherence protocol that relies upon point-to-point
1. While adaptive routing can break point-to-point order, it requires extra
ordering. The adaptive routing algorithm allows messages
buffering to avoid deadlock. To isolate the issue of adaptive routing for
to choose among minimal distance paths based on outgoing
purposes of this discussion, we simplistically avoid deadlock with full
buffering. A realistic implementation would likely use a more clever solu-
queue lengths in each direction.1
tion, such as Duato’s scheme for deadlock-free adaptive routing [8].
4

the designers to re-work the protocol and re-verify it, we
full of requests
response
explore the potential to reduce verification effort and speed
the time to market by treating this edge case as a mis-spec-
proc P1
ulation that triggers system recovery.
Specific Example. We developed a version of the snooping
proc P2
protocol system with support for speculation. The system
response
full of requests
treats a certain protocol corner case as a mis-speculation
instead of handling it.
Figure 2. Endpoint deadlock
(1) Infrequency of mis-speculation. Mis-speculation
full of messages
occurs when a cache controller has a block in state Modi-
message M1
fied (or Owned) and then issues a Writeback for the block,
switch S1
transitioning to a transient state. In this transient state, a
RequestForReadWrite arrives from another node, causing
switch S2
the cache controller to transition to a different transient
message M2
state. Then, in this second transient state, the cache control-
full of messages
ler observes another RequestForReadWrite from another
Figure 3. Switch deadlock
node. This sequence of events occurs exceedingly rarely,
especially since it begins with a Writeback from the cache
controller. Moreover, a block evicted by a Writeback is
4 Simplifying the Interconnection Network
unlikely to be requested by two other nodes. Moreover,
both nodes must request exclusive access to the block in
In this section, we discuss how to simplify deadlock
the interval of time between when the cache controller
avoidance in interconnection networks. Interconnects for
issues its Writeback and then observes its own Writeback
multiprocessors are difficult to design, partly because of
on the request network. While this scenario occurs rarely in
the difficulty of achieving high and robust performance
practice, we still must handle it.
while verifying that deadlock is impossible under all situa-
tions. There are two types of deadlock, which we refer to as
(2) Detection. The potential mis-speculation due to
endpoint deadlock and switch deadlock, based on where
encountering this unspecified coherence transition can only
the deadlock can occur. We discuss both of them now,
manifest itself as one specific invalid transition and is thus
including the primary approach for avoiding them, before
easy to detect. In this particular example, a cache controller
delving into a speculative design that is simpler.
that observes another node’s RequestForReadWrite while
in the transient state described above detects the mis-specu-
Endpoint deadlock. Endpoint deadlock can occur when
lation.
cross-coupled requests depend on each other, as shown in
Figure

2. For example, deadlock can occur if (a) processor
(3) Recovery. We use SafetyNet in all three of our specula-
P1 sends a request for block A to P2 followed by a
tively simplified designs.
response for block B, (b) P2 does the opposite (request for
(4) Forward Progress. As with the prior example, we
B followed by response for A), (c) the incoming queues for
ensure forward progress by altering the timing of the exe-
both processors are full of requests, and (d) neither proces-
cution after recovery. As a fail-safe mechanism, the system
sor can ingest its incoming request until it ingests its
temporarily enters a “slow-start” mode, in which the sys-
incoming response, and they process incoming buffers in
tem restricts the number of coherence transactions allowed
order. This type of deadlock depends on the coherence pro-
to be outstanding (e.g., one). The corner case in the proto-
tocol, but it is independent of interconnection network
col can only occur when at least two transactions race in
topology and routing.
the system. The slow-start mode performance provides a
Switch deadlock. Switch deadlock in the interconnection
worst-case lower bound on performance in the presence of
network can arise due to the combination of cross-coupled
mis-speculations. Moreover, before resorting to slow-start,
messages and insufficient buffering for in-flight messages.
the system could simply try to resume execution, perhaps
Consider the simple example illustrated in Figure

3. In this
in a slightly slower mode, in the likely hope that the race
example, switch S1 wants to send message M1 to switch
does not recur.
S2, and S2 wants to send M2 to S1. However, the buffer
Along with the coherence protocol, the other primary
from S1 to S2 and the buffer from S2 to S1 are both full
source of complexity in a multiprocessor is the intercon-
and unable to accept new messages. Moreover, neither
nection network, and we now discuss how to simplify it.
switch will process its incoming queue until it can send its
5

outgoing message. Thus, if switches process incoming
problems. At the other extreme, the Alpha 21364 intercon-
message buffers in FIFO order, the system now deadlocks,
nect uses nineteen virtual channels (six virtual networks
since neither M1 nor M2 can make progress. This type of
times three virtual channels each, plus an extra channel for
deadlock depends on the topology of the interconnect and
special messages) [16], demonstrating that complicated
the routing policy, but it does not depend on the cache
flow control is implementable.
coherence protocol.
As an alternative to virtual channel/network flow con-
To avoid both types of deadlock, interconnects can
trol, interconnect designers have used deflection routing
either use worst-case buffering or some scheme to break
(a.k.a. hot potato routing) to avoid deadlock. Deflection
the cyclic dependences that can lead to deadlock. Using
routing avoids deadlock without using any buffering, but it
worst-case buffering at each endpoint and switch is the
can suffer from potential livelock problems.
simplest solution, but it is generally not viable since the
Specific Example. To demonstrate the viability of easing
worst case can be far worse than the common case.
interconnection network design, we implemented a 2D
To avoid the costs of worst-case buffering, many inter-
torus interconnect with less than worst-case buffering and
connection networks use schemes that break the cyclic
no virtual channel/network support. We use the same sys-
dependences that can lead to deadlock [10]. To avoid end-
tem model as the directory protocol example in
point deadlock, we can use virtual networks for the differ-
Section

3.1. In the case that the system detects deadlock, it
ent classes of messages. A virtual network is just one or
recovers and resumes execution.
more incoming buffers reserved by the switches and end-
(1) Infrequency of mis-speculation. Adaptively routed
points for a particular class of messages. If we have a dif-
networks are generally designed to avoid potential dead-
ferent virtual network for each class of messages (e.g.,
lock conditions. This is because both deadlock avoidance
request and response), then the incoming queue can never
(e.g., using “escape channels” [8]) and deadlock recovery
fill up with just requests, since we have reserved space for
(e.g., as proposed here) negatively impact performance.
responses. To avoid switch deadlock, we can use virtual
Designers can size buffers to reduce the probability of
channel flow control [7]. A virtual channel is just one or
potential deadlocks. Some networks also use source throt-
more buffers per unidirectional physical link2 that is
tling to further reduce this probability [21]. Using recovery
reserved by the switch for messages of a particular priority.
to resolve deadlock changes the problem from a correct-
In our simple example, if M1 was on virtual channel 1 and
ness issue into a performance issue.
M2 was on virtual channel 2, then deadlock would not
occur. The interaction of virtual channels and virtual net-
(2) Detection. Detection of this form of mis-speculation is
works is multiplicative; if we need N virtual networks to
straightforward, since the requestor can detect all dead-
avoid endpoint deadlock and C virtual channels to avoid
locks by a time-out.3 If a message gets stuck in the inter-
switch deadlock, then we need NxC virtual channels total
connect, the coherence transaction to which it belongs will
(i.e., C per virtual network). For a 2D bidirectional torus
not complete, and the requestor of the transaction will tim-
and our directory cache coherence protocol, we require two
eout and trigger a system recovery. We choose time-out
virtual channels per virtual network (for static routing) and
latency to be long enough to mitigate false positives while
four virtual networks (i.e., 4*2=8 virtual channels total). To
short enough to not slow down SafetyNet commitment of
provide deadlock freedom with adaptive routing requires at
checkpoints.4 Since there is little gain in having a timeout
least one additional virtual channel [9].
latency shorter than necessary for SafetyNet, a processor
times out on its request after three checkpoint intervals.
Virtual channel/network flow control minimizes dead-
lock-free buffering requirements and it is well-understood,
(3) Recovery. We use SafetyNet in all three of our specula-
but it adds complexity and requires additional verification
tively simplified designs.
effort. The SGI Origin directory protocol [13] avoids this
(4) Forward Progress. As with the prior two examples, we
complexity by using only two virtual networks instead of
ensure forward progress by altering the timing of the exe-
the three that would have ensured deadlock avoidance.
cution after recovery. As a fail-safe mechanism, we enter a
Instead, the Origin relies on a higher level mechanism to
“slow-start” mode like that in Section

3.2, in which the sys-
negatively acknowledge (nack) its way out of the deadlocks
that occur due to this limitation, even though nacking
increases protocol complexity and could introduce livelock
3. A timeout mechanism would also detect livelock if we were to use
speculation to support deflection routing on a topology for which deflec-
tion routing does not provably avoid livelock.
4. SafetyNet cannot commit an old checkpoint until it is sure that execu-
tion prior to that checkpoint was mis-speculation-free. Thus, it might have
2. When we refer to virtual channel requirements, they are the require-
to wait as long as the timeout latency to either commit a checkpoint or
ments per unidirectional physical link.
trigger a system recovery.
6

Table 2. Target System Parameters
L1 Cache (I and D)
128 KB, 4-way set associative
L2 Cache
4 MB, 4-way set-associative
Memory System
Memory
2 GB, 64 byte blocks
Miss From Memory
180 ns (uncontended, 2-hop)
Interconnection Networks
link bandwidth = 400MB/sec to 3.2 GB/sec
Checkpoint Log Buffer
512 kbytes total, 72 byte entries
SafetyNet
Checkpoint Interval
100,000 cycles (directory), 3,000 requests (snooping)
Register Checkpointing Latency
100 cycles
tem restricts the number of coherence transactions allowed
Processor. We model a processor core that, given a perfect
to be outstanding. As long as we provide enough buffering
memory system, would execute four billion instructions per
to satisfy the reduced number of transactions, slow-start
second and generate blocking requests to the cache hierar-
provably avoids livelock and provides a worst-case lower
chy and beyond. We use this simple processor model to
bound on performance. We can raise this bound by provid-
enable tractable simulation times for full-system simulation
ing more buffering to support a slow-start mode that allows
of commercial workloads. While an out-of-order processor
more concurrent transactions. Moreover, before resorting
model might affect the absolute values of the results,
to slow-start, the system could simply try to resume execu-
mostly due to being able to maintain more outstanding
tion, possibly in a slower mode, in the hope that deadlock
memory requests, it would not qualitatively change them.
will not occur again. If deadlock does recur, the system can
If having additional outstanding requests leads to more
then fall back on slow-start. Slow-start adds some com-
exercising of certain races or corner cases, we could violate
plexity back into the system, but it is less than that of vir-
the first necessary feature of speculation for simplicity (i.e.,
tual channel flow control.
infrequency of mis-speculation). Then we would have to
re-consider this particular speculation but not speculation
5 Evaluation
for simplicity in general.
In this section, we evaluate the applications of specula-
Memory System. We have implemented a memory hierar-
tion for simplicity that we have developed in Sections 3 and
chy simulator that supports our coherence protocols and
4. Our first goal is to determine the mis-speculation fre-
SafetyNet. The simulator captures all state transitions
quency at which our system performance begins to
(including transient states) of our coherence protocols in
degrade. Our second goal is to determine if our speculative
the cache and memory controllers. We model the intercon-
systems incur mis-speculations infrequently enough. We
nection network topologies and the contention within them.
start by describing our system model, methodology [1], and
In Table

2, we present the design parameters of our target
workloads, and then we discuss our results.
memory systems.
SafetyNet. For our system recovery mechanism, we use
5.1 System Model and Simulation Methodology
SafetyNet [19]. We list SafetyNet system parameters in
Our target system is a 16-node shared-memory multi-
Table
2,
and
SafetyNet performance overhead (in error-free
processor. Each node consists of a processor, two levels of
execution) is minimal for these design parameters, as was
cache, some portion of the shared memory and directory,
demonstrated by Sorin et al. [19]. The checkpoint interval
and a network interface. The processor architecture is
differs between the directory and snooping systems, due to
SPARC v9, and the system runs Solaris 8.
the different logical time basis used for creating consistent
checkpoints. Recovery time varies somewhat, depending
We evaluate our target system with the Simics full-sys-
on how much work the system loses between the recovery
tem, multiprocessor, functional simulator, and we extend
point and when it detects the mis-speculation. We stress-
Simics with a memory hierarchy simulator to compute exe-
tested the recovery mechanism by periodically triggering
cution times. Simics is a system-level architectural simula-
recoveries, and we show these results in Section
5.3.
tor developed by Virtutech AB [15]. We use Simics/sun4u,
which simulates Sun Microsystems’s SPARC V9 platform
5.2 Workloads
architecture (e.g., Sun E6000s) in sufficient detail to boot
unmodified Solaris 8. Simics is a functional simulator only,
Commercial applications represent an important work-
and it assumes that each instruction takes one cycle to exe-
load for shared memory multiprocessors. As such, we eval-
cute (although I/O may take longer), but it provides an
uate our speculative design with four commercial
interface to support detailed memory hierarchy simulation.
applications and one scientific application, described
7

Table 3. Workloads: Wisconsin Commercial
1.0
Workload Suite and a Splash-2 scientific workload
OLTP: Our OLTP workload is based on the TPC-C v3.0
no mis-speculations
benchmark using IBM’s DB2 v7.2 EEE database management
1 mis-speculation per second
system. We use a 1 GB 10-warehouse database stored on five
0.5
10 mis-speculations per second
100 mis-speculations per second
raw disks and an additional dedicated database log disk. There
are 8 simulated users per processor. We warm up for 10,000
normalized performance
transactions, and we run for 500 transactions.
0.0
Java Server: SPECjbb2000 is a server-side java benchmark
jbb
apache
slashcode
oltp
barnes
that models a 3-tier system with driver threads. We used Sun’s
HotSpot 1.4.0 Server JVM. Our experiments use 24 threads and
Figure 4. Performance vs. Mis-speculation Rate
24 warehouses (~500 MB of data). We warm up for 100,000
transactions, and we run for 10,000 transactions.
Speculatively Simplified Directory Protocol Results. We
evaluated the performance of the speculatively simplified
Static Web Server: We use Apache 1.3.19 (www.apache.org)
directory protocol to determine if mis-speculations are suf-
for SPARC/Solaris 8, configured to use pthread locks and mini-
ficiently infrequent to make speculation viable. Re-order-
mal logging as the web server. We use SURGE to generate web
requests. We use a repository of 2,000 files (totalling ~50 MB).
ing and mis-speculation rates are a function of available
There are 10 simulated users per processor. We warm up for
bandwidth, since increasing the bandwidth provides fewer
~80,000 requests and run for 5000 requests.
opportunities for adaptive routing. The primary result is
that virtually no reorderings occur in our system, even for
Dynamic Web Server: Slashcode is based on a dynamic web
message posting system used by slashdot.com. We use Slash-
link bandwidths as low as 400 MBytes/sec.
code 2.0, Apache 1.3.20, and Apache’s mod_perl 1.25 module
Adaptive routing incurs few recoveries for two reasons.
for the web server. MySQL 3.23.39 is the database engine. The
First, re-orderings are rare, even for link bandwidths of 400
database is a snapshot of slashcode.com, and it contains
MBytes/sec. With mean link utilizations between 13-35%
~3,000 messages. A multithreaded driver simulates browsing
(for static routing), there is little opportunity for adaptive
and posting behavior for 3 users per processor. We warm up for
routing to re-order messages to avoid congestion. Second,
240 transactions and run for 50 transactions.
when re-orderings do occur, the vast majority of them do
Scientific Application: We use barnes-hut from the SPLASH-
not affect correctness. While adaptive routing re-ordered
2 suite [22], with the 16K body input set. We measure from the
0.1-0.2% of all messages on the ForwardedRequest virtual
start of the parallel phase to avoid measuring thread forking.
network, we observed only a handful of recoveries in all
briefly in Table

3 and in more detail by Alameldeen et al.
simulations. On other virtual networks, adaptive routing re-
[1]. To address the variability in runtimes for commercial
ordered as many as 0.8% of messages, but these re-order-
workloads, we simulate each design point multiple times
ings cannot violate correctness in our protocol. It also re-
with small, pseudo-random perturbations of memory laten-
ordered messages across virtual networks (e.g., a request
cies to cause alternative scheduling paths [1]. Error bars in
arrived after a response despite being sent first), but this re-
results represent one standard deviation in each direction.
ordering does not matter in our protocol.
While it might thus appear that adaptive routing is not
5.3 Results
worthwhile, it can still help performance during periods of
We now present the results of our evaluations of each of
higher congestion. Although mean link utilization is low,
our three speculative designs. We first explore the impact
instantaneous utilization is sometimes much greater. In
of mis-speculation, in general, by stress-testing the sys-
these instances, adaptive routing can route more messages
tem’s ability to recover from a range of mis-speculation
around congestion. In Figure

5, for an interconnect with
rates. While mis-speculation occurs infrequently in our
link bandwidth of 400 MBytes/sec, we compare the relative
speculative systems (shown next), we want to determine
performances of systems with static and adaptive routing,
the mis-speculation rate at which the latency of recoveries
and we normalize the results to the performance of static
can impact performance. To isolate the impact of increas-
routing. We observe that adaptive routing achieves a signif-
ing the frequency of recoveries, we implement a system
icant speedup for our workloads because of better instanta-
without speculation and inject periodic recoveries. The
neous link utilization and the infrequency of recoveries.
results, shown in Figure

4, reveal that recovery is suffi-
ciently short that the performance cost of recovering even
5. Mis-speculation does not improve performance for OLTP. The error
ten times per second is negligible.5
bars show that, despite the greater mean performance value with mis-
speculations, the performance is not statistically significantly better.
8

We conclude from this experiment that we can simplify
1.5
interconnection network design with speculation. We can
remove virtual channel deadlock avoidance and specula-
1.0
tively assume that deadlock will not occur. When it does,
static routing
adaptive routing
the system can recover and still make forward progress.
0.5
Summary of Results. These experimental results show
normalized performance
that speculation is a viable technique for simplifying sys-
0.0
tem design. In general, a speculative system can maintain
jbb
apache
slash
oltp
barnes
its performance even when ten recoveries per second occur,
Figure 5. Relative performances of static and adaptive
and our speculative systems incur recoveries less fre-
routing (400 Mbytes/sec link bandwidth)
quently than that.
In summary, designers can simplify a directory protocol
6 Related Work in Simplifying System Design
by speculatively relying on point-to-point ordering despite
the use of adaptive routing. Moreover, the use of adaptive
There exists some prior research in simplifying system
routing may allow for cheaper links and fewer pins, while
design by allocating resources towards common-case sce-
achieving the same bandwidth as a statically routed yet
narios. We illustrate four notable examples of this
more costly interconnect.
approach.
Speculatively Simplified Snooping Protocol Results. We
The first example is the use of software for solving com-
tested the speculatively simplified snooping coherence pro-
plex processor hardware problems. For example, proces-
tocol on our set of commercial workloads, and all of them
sors have trapped to software for floating point arithmetic,
ran to completion without needing to recover even once
such as the Intel 80386 without the 80387 floating point
from reaching the edge case. Thus, performance of the pro-
coprocessor. No recovery is necessary for these traps to
tocol mirrors, for these workloads, that of the fully
software. Also, numerous architectures give the user the
designed protocol. While this observed lack of recoveries
option of trapping to software for IEEE standard denormal-
obviously does not guarantee that the speculative protocol
ized floating point arithmetic, including SPARC v9 [20]
will never have to recover, it does suggest the infrequency
and Intel IA-64 [12]. Localized processor recovery may be
of recoveries due to encountering this corner case in the
necessary in these cases to support precise interrupts.
cache coherence protocol.
Second, prior research has explored the use of fewer vir-
We conclude from this experiment that a system can tol-
tual networks than strictly necessary in all cases. The SGI
erate a corner case in the cache coherence protocol with
Origin [13] can detect deadlock and then fall back on a
speculation. Even if recoveries due to mis-speculations
higher-level mechanism, specified only vaguely in the pub-
slightly degrade performance, the reduced time for design
lic literature, to undo the deadlock and enable forward
and verification provides a benefit that is likely to more
progress. Similarly, more general progressive deadlock
than offset the runtime cost of recoveries.
recovery schemes have been proposed for endpoint dead-
lock [18]. The MIT Alewife uses software to recover from
Simplified Interconnection Network Results. To deter-
interconnect deadlock due to insufficient resources in rare
mine the performance impact of recoveries, we compare
situations [14].
the performance of this system against a system with the
same protocol running on an interconnection network with
Third, some processors have relied on the inherent
worst-case buffering (given no virtual channels/networks).
recoverability of dynamically scheduled processors to
Results (not plotted) show steady performance for systems
recover from corner cases. For example, the Pentium Pro
with buffer sizes at and above 16 but a sharp dropoff in per-
flushes its pipeline when it manipulates certain control reg-
formance for systems with buffers of size 8. Deadlocks do
isters [11]. The Intel Pentium 4 recovers from deadlocks
not occur in any of our workloads until we reduce buffer
due to corner cases in scheduling the processor core [5].
sizing from 16 to 8. Although our 16-processor system can
After recovery, the core will not deadlock, since the recov-
only have a total of 16 requests outstanding, deadlock with
ery (and flush) changes the microarchitectural state enough
buffers of size 16 is still possible because (a) each proces-
to avoid the same corner case that originally caused the
sor can also have a writeback outstanding, (b) each request
deadlock.
can be forwarded from the directory to multiple destina-
Lastly, DIVA [2] uses a simple checker processor to
tions, and (c) all message types share the same buffers.
dynamically verify a complex and possibly faulty proces-
sor and ensure correctness even in the case of design errors.
9

DIVA is the closest work to speculation for simplicity, but
Systems, 12(12):1219–1235, Dec. 2001.
DIVA is limited to only the processor core.
[10] J.
Duato, S.
Yalamanchili, and L.
Ni. Interconnection
Networks. IEEE Computer Society Press, 1997.
7 Conclusions
[11] Intel Corporation. Pentium Processor Family Developer’s
Manual, Volume 3: Architecture and Programming
In this paper, we have discussed how to use speculation
Manual. Intel Corporation, 1995.
to simplify multiprocessor system design. Speculation
[12] Intel Corporation. Intel IA-64 Architecture Software
allows the designer to allocate design and verification effort
Developer’s Manual, Volume 2: IA-64 System Architecture,
towards the common case scenarios while falling back on
Revision 1.1, July 2000.
system-wide recovery for situations that are rare and com-
[13] J.
Laudon and D.
Lenoski.
The
SGI
Origin:
A
ccNUMA
plicated. The three applications of speculation for simplic-
Highly Scalable Server. In Proceedings of the 24th Annual
ity that we have presented have demonstrated the potential
International Symposium on Computer Architecture, pages
241–251, June 1997.
to simplify system design.
[14] K.
Mackenzie
et
al.
Exploiting
Two-Case
Delivery
for
Fast
Acknowledgments
Protected Messaging. In Proceedings of the Fourth IEEE
Symposium on High-Performance Computer Architecture
,
We would like to thank José Duato, Alvy Lebeck, Amir
Feb. 1998.
Roth, and the Wisconsin Multifacet Group for their helpful
[15] P.
S. Magnusson et al. Simics: A Full System Simulation
discussions of this work.
Platform. IEEE Computer, 35(2):50–58, Feb. 2002.
[16] S.
Mukherjee et al. The Alpha 21364 Network Architecture.
In Proceedings of 9th Hot Interconnects Symposium, Aug.
References
2001.
[1] A.
R.
Alameldeen
et
al.
Simulating
a
$2M
Commercial
[17] M.
Prvulovic,
Z.
Zhang,
and
J.
Torrellas. ReVive: Cost-
Server on a $2K PC. IEEE Computer, 36(2):50–57, Feb.
Effective Architectural Support for Rollback Recovery in
2003.
Shared-Memory Multiprocessors. In Proceedings of the
29th Annual International Symposium on Computer

[2] T.
M. Austin. DIVA: A Reliable Substrate for Deep
Architecture, pages 111–122, May 2002.
Submicron Microarchitecture Design. In Proceedings of the
32nd Annual International Symposium on
[18] Y.
H.
Song
and
T.
M.
Pinkston.
A
Progressive
Approach
to
Microarchitecture, pages 196–207, Nov. 1999.
Handling Message-Dependent Deadlock in Parallel
Computer Systems. IEEE Transactions on Parallel and
[3] R.
M.
Bentley.
Validating
the
Pentium
4
Microprocessor.
In
Distributed Systems, 14(1):1–17, Jan. 2003.
Proceedings of the International Conference on
Dependable Systems and Networks
, pages 493–498, July
[19] D.
J.
Sorin,
M.
M.
Martin,
M.
D.
Hill,
and
D.
A.
Wood.
2001.
SafetyNet: Improving the Availability of Shared Memory
Multiprocessors with Global Checkpoint/Recovery. In
[4] E.
M.
Clarke
and
J.
M. Wing. Formal Methods: State of the
Proceedings of the 29th Annual International Symposium on
Art and Future Directions. ACM Computing Surveys,
Computer Architecture, pages 123–134, May 2002.
28(4):626–643, Dec. 1996.
[20] Sun Microsystems. UltraSPARC User’s Manual. Sun
[5] B.
Colwell. Personal Communication, June 2002.
Microsystems, Inc., July 1997.
[6] D.
E. Culler and J.
Singh.
Parallel Computer Architecture:
[21] M.
Thottethodi,
A.
R.
Lebeck,
and
S.
S.
Mukherjee.
Self-
A Hardware/Software Approach. Morgan Kaufmann
Tuned Congestion Control for Multiprocessor Networks. In
Publishers, Inc., 1999.
Proceedings of the Seventh IEEE Symposium on High-
[7] W.
J. Dally. Virtual Channel Flow Control. IEEE
Performance Computer Architecture, Jan. 2001.
Transactions on Parallel and Distributed Systems,
[22] S.
C. Woo et al. The SPLASH-2 Programs: Characterization
3(2):194–205, Mar. 1992.
and Methodological Considerations. In Proceedings of the
[8] J.
Duato. A New Theory of Deadlock-Free Adaptive
22nd Annual International Symposium on Computer
Routing in Wormhole Networks. IEEE Transactions on
Architecture, pages 24–37, June 1995.
Parallel and Distributed Systems, 4(12):1320–1331, Dec.
[23] D.
Wood, G.
Gibson, and R.
Katz. Verifying a
1993.
Multiprocessor Cache Controller Using Random Test
[9] J.
Duato and T.
M. Pinkston. A General Theory for
Generation. IEEE Design and Test of Computers, pages 13–
Deadlock-Free Adaptive Routing Using a Mixed Set of
25, Aug. 1990.
Resources. IEEE Transactions on Parallel and Distributed
10

Document Outline