Multiprocessor Synchronization
Multiprocessor Architecture
Basics
Companion slides for
The Art of Multiprocessor
Programming
by Maurice Herlihy & Nir Shavit
Multiprocessor Architecture
• Abstract models are (mostly) OK to
understand algorithm correctness
• To understand how concurrent
algorithms perform
• You need to understand something
about multiprocessor architectures
Art of Multiprocessor
2
Programming
Pieces
• Processors
• Threads
• Interconnect
• Memory
• Caches
Art of Multiprocessor
3
Programming
Design an Urban Messenger
Service in 1980
• Downtown Manhattan
• Should you use
– Cars (1980 Buicks, 15 MPG)?
– Bicycles (hire recent graduates)?
• Better use bicycles
Art of Multiprocessor
4
Programming
Technology Changes
• Since 1980, car technology has
changed enormously
– Better mileage (hybrid cars, 35 MPG)
– More reliable
• Should you rethink your Manhattan
messenger service?
Art of Multiprocessor
5
Programming
Processors
• Cycle:
– Fetch and execute one instruction
• Cycle times change
– 1980: 10 million cycles/sec
– 2005: 3,000 million cycles/sec
Art of Multiprocessor
6
Programming
Computer Architecture
• Measure time in cycles
– Absolute cycle times change
• Memory access: ~100s of cycles
– Changes slowly
– Mostly gets worse
Art of Multiprocessor
7
Programming
Threads
• Execution of a sequential program
• Software, not hardware
• A processor can run a thread
• Put it aside
– Thread does I/O
– Thread runs out of time
• Run another thread
Art of Multiprocessor
8
Programming
Interconnect
• Bus
– Like a tiny Ethernet
– Broadcast medium
– Connects
• Processors to memory
• Processors to processors
• Network
– Tiny LAN
– Mostly used on large machines
Art of Multiprocessor
9
Programming
Interconnect
• Interconnect is a finite resource
• Processors can be delayed if others
are consuming too much
• Avoid algorithms that use too much
bandwidth
Art of Multiprocessor
10
Programming
Analogy
• You work in an office
• When you leave for lunch, someone
else takes over your office.
• If you don’t take a break, a security
guard shows up and escorts you to
the cafeteria.
• When you return, you may get a
different office
Art of Multiprocessor
11
Programming
Processor and Memory are Far
Apart
memory
interconnect
processor
Art of Multiprocessor
12
Programming
Reading from Memory
address
Art of Multiprocessor
13
Programming
Reading from Memory
zzz…
Art of Multiprocessor
14
Programming
Reading from Memory
value
Art of Multiprocessor
15
Programming
Writing to Memory
address, value
Art of Multiprocessor
16
Programming
Writing to Memory
zzz…
Art of Multiprocessor
17
Programming
Writing to Memory
ack
Art of Multiprocessor
18
Programming
Remote Spinning
• Thread waits for a bit in memory to
change
– Maybe it tried to dequeue from an
empty buffer
• Spins
– Repeatedly rereads flag bit
• Huge waste of interconnect
bandwidth
Art of Multiprocessor
19
Programming
Analogy
• In the days before the Internet …
• Alice is writing a paper on aardvarks
• Sources are in university library
– Request book by campus mail
– Book arrives by return mail
– Send it back when not in use
• She spends a lot of time in the mail
room
Art of Multiprocessor
20
Programming
Analogy II
• Alice buys
– A desk
• In her office
• To keep the books she is using now
– A bookcase
• in the hall
• To keep the books she will need soon
Art of Multiprocessor
21
Programming
Cache: Reading from Memory
address
cache
Art of Multiprocessor
22
Programming
Cache: Reading from Memory
cache
Art of Multiprocessor
23
Programming
Cache: Reading from Memory
cache
Art of Multiprocessor
24
Programming
Cache Hit
?
cache
Art of Multiprocessor
25
Programming
Cache Hit
Yes!
cache
Art of Multiprocessor
26
Programming
Cache Miss
address
?
No…
cache
Art of Multiprocessor
27
Programming
Cache Miss
cache
Art of Multiprocessor
28
Programming
Cache Miss
cache
Art of Multiprocessor
29
Programming
Local Spinning
• With caches, spinning becomes practical
• First time
– Load flag bit into cache
• As long as it doesn’t change
– Hit in cache (no interconnect used)
• When it changes
– One-time cost
– See cache coherence below
Art of Multiprocessor
30
Programming
Granularity
• Caches operate at a larger
granularity than a word
• Cache line: fixed-size block containing
the address
Art of Multiprocessor
31
Programming
Locality
• If you use an address now, you will
probably use it again soon
– Fetch from cache, not memory
• If you use an address now, you will
probably use a nearby address soon
– In the same cache line
Art of Multiprocessor
32
Programming
Hit Ratio
• Proportion of requests that hit in the
cache
• Measure of effectiveness of caching
mechanism
• Depends on locality of application
Art of Multiprocessor
33
Programming
L1 and L2 Caches
L2
L1
Art of Multiprocessor
34
Programming
L1 and L2 Caches
L2
Small & fast
L1
1 or 2 cycles
~16 byte line
Art of Multiprocessor
35
Programming
L1 and L2 Caches
Larger and slower
10s of cycles
~1K line size
L2
L1
Art of Multiprocessor
36
Programming
When a Cache Becomes Full…
• Need to make room for new entry
• By evicting an existing entry
• Need a replacement policy
– Usually some kind of least recently used
heuristic
Art of Multiprocessor
37
Programming
Fully Associative Cache
• Any line can be anywhere in the cache
– Advantage: can replace any line
– Disadvantage: hard to find lines
Art of Multiprocessor
38
Programming
Direct Mapped Cache
• Every address has exactly 1 slot
– Advantage: easy to find a line
– Disadvantage: must replace fixed line
Art of Multiprocessor
39
Programming
K-way Set Associative Cache
• Each slot holds k lines
– Advantage: pretty easy to find a line
– Advantage: some choice in replacing line
Art of Multiprocessor
40
Programming
Contention
• Alice and Bob are both writing
research papers on aardvarks.
• Alice has encyclopedia vol AA-AC
• Bob asks library for it
– Library asks Alice to return it
– Alice returns it & rerequests it
– Library asks Bob to return it…
Art of Multiprocessor
41
Programming
Contention
• Good to avoid memory contention.
• Idle processors
• Consumes interconnect bandwidth
Art of Multiprocessor
42
Programming
Contention
• Alice is still writing a research paper
on aardvarks.
• Carol is writing a tourist guide to the
German city of Aachen
• No conflict?
– Library deals with volumes, not articles
– Both require same encyclopedia volume
Art of Multiprocessor
43
Programming
False Sharing
• Two processors may conflict over
disjoint addresses
• If those addresses lie on the same
cache line
Art of Multiprocessor
44
Programming
False Sharing
• Large cache line size
– increases locality
– But also increases likelihood of false
sharing
• Sometimes need to “scatter” data to
avoid this problem
Art of Multiprocessor
45
Programming
Cache Coherence
• Processor A and B both cache
address x
• A writes to x
– Updates cache
• How does B find out?
• Many cache coherence protocols in
literature
Art of Multiprocessor
46
Programming
MESI
• Modified
– Have modified cached data, must write
back to memory
Art of Multiprocessor
47
Programming
MESI
• Modified
– Have modified cached data, must write
back to memory
• Exclusive
– Not modified, I have only copy
Art of Multiprocessor
48
Programming
MESI
• Modified
– Have modified cached data, must write
back to memory
• Exclusive
– Not modified, I have only copy
• Shared
– Not modified, may be cached elsewhere
Art of Multiprocessor
49
Programming
MESI
• Modified
– Have modified cached data, must write back to
memory
• Exclusive
– Not modified, I have only copy
• Shared
– Not modified, may be cached elsewhere
• Invalid
– Cache contents not meaningful
Art of Multiprocessor
50
Programming
Processor Issues Load Request
load x
cache
cache
cache
Bus
memory data
Art of Multiprocessor
51 (1)
Programming
Memory Responds
cache
E
cache
cache
Bus
Bus
Got
it!
memory data
da
Art of Multiprocessor
52(3)
Programming
Processor Issues Load Request
Load
x
data
cache
cache
E
Bus
memory data
Art of Multiprocessor
53(2)
Programming
Other Processor Responds
Got it
data
ES da
S cache
cache
Bus
Bus
memory data
Art of Multiprocessor
54(2)
Programming
Modify Cached Data
S data S data
cache
Bus
memory data
Art of Multiprocessor
55 (1)
Programming
Write-Through Cache
Write x!
S data S data
cache
Bus
memory data
Art of Multiprocessor
56(5)
Programming
Write-Through Caches
• Immediately broadcast changes
• Good
– Memory, caches always agree
– More read hits, maybe
• Bad
– Bus traffic on all writes
– Most writes to unshared data
– For example, loop indexes …
Art of Multiprocessor
57 (1)
Programming
Write-Through Caches
• Immediately broadcast changes
• Good
“show stoppers”
– Memory, caches always agree
– More read hits, maybe
• Bad
– Bus traffic on all writes
– Most writes to unshared data
– For example, loop indexes …
Art of Multiprocessor
58 (1)
Programming
Write-Back Caches
• Accumulate changes in cache
• Write back when line evicted
– Need the cache for something else
– Another processor wants it
Art of Multiprocessor
59
Programming
InvalidateInvalidate
x
SI data
cache
S
M data
cache
Bus
memory data
Art of Multiprocessor
60(4)
Programming
Multicore Architectures
• The university president
– Alarmed by fall in productivity
• Puts Alice, Bob, and Carol in same
corridor
– Private desks
– Shared bookcase
• Contention costs go way down
Art of Multiprocessor
61
Programming
Old-School Multiprocessor
cache
cache
cache
Bus
Bus
memory
Art of Multiprocessor
62
Programming
Multicore Architecture
cache
cache
cache
cache
Bus
Bus
memory
Art of Multiprocessor
63
Programming
Multicore
• Private L1 caches
• Shared L2 caches
• Communication between same-chip
processors now very fast
• Different-chip processors still not so
fast
Art of Multiprocessor
64
Programming
NUMA Architectures
• Alice and Bob transfer to NUMA
State University
• No centralized library
• Each office basement holds part of
the library
Art of Multiprocessor
65
Programming
Distributed Shared-Memory
Architectures
• Alice’s has volumes that start with A
– Aardvark papers are convenient: run
downstairs
– Zebra papers are inconvenient: run
across campus
Art of Multiprocessor
66
Programming
SMP vs NUMA
memory
SMP
NUMA
• SMP: symmetric multiprocessor
• NUMA: non-uniform memory access
• CC-NUMA: cache-coherent …
Art of Multiprocessor
67(1)
Programming
Spinning Again
• NUMA without cache
– OK if local variable
– Bad if remote
• Cc-NUMA
– Like SMP
Art of Multiprocessor
68
Programming
Relaxed Memory
• Remember the flag principle?
– Alice and Bob’s flag variables false
• Alice writes true to her flag and
reads Bob’s
• Bob writes true to his flag and reads
Alice’s
• One must see the other’s flag true
Art of Multiprocessor
69
Programming
Not Necessarily So
• Sometimes the compiler reorders
memory operations
• Can improve
– cache performance
– interconnect use
• But unexpected concurrent
interactions
Art of Multiprocessor
70
Programming
Write Buffers
address
• Absorbing
• Batching
Art of Multiprocessor
71
Programming
Volatile
• In Java, if a variable is declared
volatile, operations won’t be
reordered
• Expensive, so use it only when needed
Art of Multiprocessor
72
Programming
This work is licensed under a Creative Commons Attribution-
ShareAlike 2.5 License.
• You are free:
– to Share — to copy, distribute and transmit the work
– to Remix — to adapt the work
• Under the following conditions:
– Attribution. You must attribute the work to ―The Art of
Multiprocessor Programming‖ (but not in any way that
suggests that the authors endorse you or your use of the
work).
– Share Alike. If you alter, transform, or build upon this work,
you may distribute the resulting work only under the same,
similar or a compatible license.
• For any reuse or distribution, you must make clear to others the
license terms of this work. The best way to do this is with a link
to
– http://creativecommons.org/licenses/by-sa/3.0/.
• Any of the above conditions can be waived if you get permission
from the copyright holder.
• Nothing in this license impairs or restricts the author's moral
rights.
Art of Multiprocessor
73
Programming