Original PDF Flash format a-scalable-content-addressable-network  


A Scalable Content Addressable Network

A Scalable Content-Addressable Network
Sylvia Ratnasamy1 2
Paul Francis2
Mark Handley2
Richard Karp1 2
Scott Shenker2
1
Dept. of Electrical Eng. & Comp. Sci.
2
ACIRI
University of California, Berkeley
AT&T Center for Internet Research at ICSI
Berkeley, CA, USA
Berkeley, CA, USA
ABSTRACT
hardware, bandwidth, or rack space. As such, peer-to-peer file shar-
ing may lead to new content distribution models for applications
Hash tables – which map “keys” onto “values” – are an essential building
such as software distribution, file sharing, and static web content
block in modern software systems. We believe a similar functionality would
delivery.
be equally valuable to large distributed systems. In this paper, we intro-
Unfortunately, most of the current peer-to-peer designs are not
duce the concept of a Content-Addressable Network (CAN) as a distributed
scalable. For example, in Napster a central server stores the in-
infrastructure that provides hash table-like functionality on Internet-like
dex of all the files available within the Napster user community.
scales. The CAN is scalable, fault-tolerant and completely self-organizing,
To retrieve a file, a user queries this central server using the de-
and we demonstrate its scalability, robustness and low-latency properties
sired file’s well known name and obtains the IP address of a user
through simulation.
machine storing the requested file. The file is then down-loaded di-
rectly from this user machine. Thus, although Napster uses a peer-
1.
INTRODUCTION
to-peer communication model for the actual file transfer, the pro-
A hash table is a data structure that efficiently maps “keys” onto
cess of locating a file is still very much centralized. This makes it
“values” and serves as a core building block in the implementa-
both expensive (to scale the central directory) and vulnerable (since
tion of software systems. We conjecture that many large-scale dis-
there is a single point of failure). Gnutella goes a step further and
tributed systems could likewise benefit from hash table functional-
de-centralizes the file location process as well. Users in a Gnutella
ity. We use the term Content-Addressable Network (CAN) to de-
network self-organize into an application-level mesh on which re-
scribe such a distributed, Internet-scale, hash table.
quests for a file are flooded with a certain scope. Flooding on every
Perhaps the best example of current Internet systems that could
request is clearly not scalable [7] and, because the flooding has to
potentially be improved by a CAN are the recently introduced peer-
be curtailed at some point, may fail to find content that is actu-
to-peer file sharing systems such as Napster [14] and Gnutella [6].
ally in the system. We started our investigation with the question:
In these systems, files are stored at the end user machines (peers)
could one make a scalable peer-to-peer file distribution system? We
rather than at a central server and, as opposed to the traditional
soon recognized that central to any peer-to-peer system is the in-
client-server model, files are transferred directly between peers.
dexing scheme used to map file names (whether well known or dis-
These peer-to-peer systems have become quite popular. Napster
covered through some external mechanism) to their location in the
was introduced in mid-1999 and, as of December 2000, the soft-
system. That is, the peer-to-peer file transfer process is inherently
ware has been down-loaded by 50 million users, making it the
scalable, but the hard part is finding the peer from whom to retrieve
fastest growing application on the Web. New file sharing systems
the file. Thus, a scalable peer-to-peer system requires, at the very
such as Scour, FreeNet, Ohaha, Jungle Monkey, and MojoNation
least, a scalable indexing mechanism. We call such indexing sys-
have all been introduced within the last year.
tems Content-Addressable Networks and, in this paper, propose a
While there remains some (quite justified) skepticism about the
particular CAN design.
business potential of these file sharing systems, we believe their
However, the applicability of CANs is not limited to peer-to-
rapid and wide-spread deployment suggests that there are impor-
peer systems. CANs could also be used in large scale storage
tant advantages to peer-to-peer systems. Peer-to-peer designs har-
management systems such as OceanStore [10], Farsite [1], and
ness huge amounts of resources - the content advertised through
Publius [13]. These systems all require efficient insertion and re-
Napster has been observed to exceed 7 TB of storage on a single
trieval of content in a large distributed storage infrastructure; a scal-
day,1 without requiring centralized planning or huge investments in
able indexing mechanism is an essential component of such an in-
frastructure. In fact, as we discuss in Section 5, the OceanStore
1
Private communication with Yin Zhang and Vern Paxson
system already includes a CAN in its core design (although the
OceanStore CAN, based on Plaxton’s algorithm[15], is somewhat
different from what we propose here).
Another potential application for CANs is in the construction of
Permission to make digital or hard copies of all or part of this work for
wide-area name resolution services that (unlike the DNS) decou-
personal or classroom use is granted without fee provided that copies are
ple the naming scheme from the name resolution process thereby
not made or distributed for profit or commercial advantage and that copies
enabling arbitrary, location-independent naming schemes.
bear this notice and the full citation on the first page. To copy otherwise, to
Our interest in CANs is based on the belief that a hash table-
republish, to post on servers or to redistribute to lists, requires prior specific
like abstraction would give Internet system developers a powerful
permission and/or a fee.
SIGCOMM’01, August 27-31, 2001, San Diego, California, USA..
design tool that could enable new applications and communication
Copyright 2001 ACM 1-58113-411-8/01/0008 ...$5.00.
161

models. However, in this paper our focus is not on the use of CANs
Intuitively, routing in a Content Addressable Network works by
but on their design. In [17], we describe, in some detail, one possi-
following the straight line path through the Cartesian space from
ble application, which we call a “grass-roots” content distribution
source to destination coordinates.
system, that leverages our CAN work.
A CAN node maintains a coordinate routing table that holds the
As we have said, CANs resemble a hash table; the basic oper-
IP address and virtual coordinate zone of each of its immediate
ations performed on a CAN are the insertion, lookup and deletion
neighbors in the coordinate space. In a d-dimensional coordinate
of (key,value) pairs. In our design, the CAN is composed of many
space, two nodes are neighbors if their coordinate spans overlap
individual nodes. Each CAN node stores a chunk (called a zone) of
along d
dimensions and abut along one dimension. For example,
;1
the entire hash table. In addition, a node holds information about
in Figure 2, node 5 is a neighbor of node 1 because its coordinate
a small number of “adjacent” zones in the table. Requests (insert,
zone overlaps with 1’s along the Y axis and abuts along the X-axis.
lookup, or delete) for a particular key are routed by intermediate
On the other hand, node 6 is not a neighbor of 1 because their co-
CAN nodes towards the CAN node whose zone contains that key.
ordinate zones abut along both the X and Y axes. This purely local
Our CAN design is completely distributed (it requires no form of
neighbor state is sufficient to route between two arbitrary points in
centralized control, coordination or configuration), scalable (nodes
the space: A CAN message includes the destination coordinates.
maintain only a small amount of control state that is independent
Using its neighbor coordinate set, a node routes a message towards
of the number of nodes in the system), and fault-tolerant (nodes
its destination by simple greedy forwarding to the neighbor with
can route around failures). Unlike systems such as the DNS or IP
coordinates closest to the destination coordinates. Figure 2 shows
routing, our design does not impose any form of rigid hierarchical
a sample routing path.
naming structure to achieve scalability. Finally, our design can be
For a d dimensional space partitioned into n equal zones, the av-
implemented entirely at the application level.
erage routing path length is d=
n1=d hops and individual nodes
(
4)(
)
In what follows, we describe our basic design for a CAN in Sec-
maintain d neighbors3. These scaling results mean that for a d-
2
tion 2, describe and evaluate this design in more detail in Section 3
dimensional space, we can grow the number of nodes (and hence
and discuss our results in Section 4. We discuss related work in
zones) without increasing per node state while the average path
Section 5 and directions for future work in Section 6.
length grows as O n1=d .
(
)
Note that many different paths exist between two points in the
2.
DESIGN
space and so, even if one or more of a node’s neighbors were to
crash, a node can automatically route along the next best available
First we describe our Content Addressable Network in its most
path.
basic form; in Section 3 we present additional design features that
If however, a node loses all its neighbors in a certain direction,
greatly improve performance and robustness.
and the repair mechanisms described in Section 2.3 have not yet
Our design centers around a virtual d-dimensional Cartesian co-
rebuilt the void in the coordinate space, then greedy forwarding
ordinate space on a d-torus.2 This coordinate space is completely
may temporarily fail. In this case, a node may use an expanding
logical and bears no relation to any physical coordinate system. At
ring search (using stateless, controlled flooding over the unicast
any point in time, the entire coordinate space is dynamically par-
CAN overlay mesh) to locate a node that is closer to the destination
titioned among all the nodes in the system such that every node
than itself. The message is then forwarded to this closer node, from
“owns” its individual, distinct zone within the overall space. For
which greedy forwarding is resumed.
example, Figure 1 shows a 2-dimensional
coordinate
0
1]
0
1]
space partitioned between 5 CAN nodes.
2.2
CAN construction
This virtual coordinate space is used to store (key,value) pairs
As described above, the entire CAN space is divided amongst
as follows: to store a pair (K ,V ), key K is deterministically
the nodes currently in the system. To allow the CAN to grow in-
1
1
1
mapped onto a point P in the coordinate space using a uniform
crementally, a new node that joins the system must be allocated its
hash function. The corresponding (key,value) pair is then stored
own portion of the coordinate space. This is done by an existing
at the node that owns the zone within which the point P lies. To
node splitting its allocated zone in half, retaining half and handing
retrieve an entry corresponding to key K , any node can apply the
the other half to the new node.
1
same deterministic hash function to map K onto point P and then
The process takes three steps:
1
retrieve the corresponding value from the point P . If the point P
is not owned by the requesting node or its immediate neighbors,
1. First the new node must find a node already in the CAN.
the request must be routed through the CAN infrastructure until it
2. Next, using the CAN routing mechanisms, it must find a node
reaches the node in whose zone P lies. Efficient routing is therefore
whose zone will be split.
a critical aspect of a CAN.
Nodes in the CAN self-organize into an overlay network that rep-
3. Finally, the neighbors of the split zone must be notified so
resents this virtual coordinate space. A node learns and maintains
that routing can include the new node.
the IP addresses of those nodes that hold coordinate zones adjoin-
ing its own zone. This set of immediate neighbors in the coordinate
Bootstrap
space serves as a coordinate routing table that enables routing be-
A new CAN node first discovers the IP address of any node cur-
tween arbitrary points in this space.
rently in the system. The functioning of a CAN does not depend
We will describe the three most basic pieces of our design: CAN
3
routing, construction of the CAN coordinate overlay, and mainte-
Recently proposed routing algorithms for location services [15,
20] route in O
n hops with each node maintaining O n
nance of the CAN overlay.
(log
)
(log
)
neighbors. Notice that were we to select the number of dimensions
d=
n = , we could achieve the same scaling properties.We
2.1
Routing in a CAN
(log
)
2
2
choose to hold d fixed independent of n, since we envision apply-
ing CANs to very large systems with frequent topology changes.
2
For simplicity, the illustrations in this paper do not show a torus,
In such systems, it is important to keep the number of neighbors
so the reader must remember that the coordinate space wraps.
independent of the system size
162

(0.5-0.75,0.5-1.0)
1.0
6
2
C
6
2
(0-0.5,0.5-1.0)
D
E
(0.75-1.0,0.5-1.0)
A
3
1
5
3
1
7
5
B
(0-0.5,0-0.5)
(0.5-1.0,0.0-0.5)
0.0
0.0
1.0
4
4
node B’s virtual coordinate zone
(x,y)
Figure 1: Example 2-d space with 5 nodes
sample routing
path from node 1
to point (x,y)

1’s coordinate neighbor set = {2,3,4,5}
1’s coordinate neighbor set = {2,3,4,7}
7’s coordinate neighbor set = { }
7’s coordinate neighbor set = {1,2,4,5}
Figure 2: Example 2-d space before node
Figure 3: Example 2-d space after node
7 joins
7 joins
on the details of how this is done, but we use the same bootstrap
number of nodes in the system. Thus, node insertion affects only
mechanism as YOID [4].
O(number of dimensions) existing nodes, which is important for
As in [4] we assume that a CAN has an associated DNS domain
CANs with huge numbers of nodes.
name, and that this resolves to the IP address of one or more CAN
bootstrap nodes. A bootstrap node maintains a partial list of CAN
2.3
Node departure, recovery and CAN main-
nodes it believes are currently in the system. Simple techniques to
tenance
keep this list reasonably current are described in [4].
When nodes leave a CAN, we need to ensure that the zones they
To join a CAN, a new node looks up the CAN domain name in
occupied are taken over by the remaining nodes. The normal pro-
DNS to retrieve a bootstrap node’s IP address. The bootstrap node
cedure for doing this is for a node to explicitly hand over its zone
then supplies the IP addresses of several randomly chosen nodes
and the associated (key,value) database to one of its neighbors. If
currently in the system.
the zone of one of the neighbors can be merged with the departing
node’s zone to produce a valid single zone, then this is done. If
Finding a Zone
not, then the zone is handed to the neighbor whose current zone is
The new node then randomly chooses a point P in the space and
smallest, and that node will then temporarily handle both zones.
sends a JOIN request destined for point P . This message is sent
The CAN also needs to be robust to node or network failures,
into the CAN via any existing CAN node. Each CAN node then
where one or more nodes simply become unreachable. This is han-
uses the CAN routing mechanism to forward the message, until it
dled through an immediate takeover algorithm that ensures one of
reaches the node in whose zone P lies.
the failed node’s neighbors takes over the zone. However in this
This current occupant node then splits its zone in half and assigns
case the (key,value) pairs held by the departing node are lost until
one half to the new node. The split is done by assuming a certain
the state is refreshed by the holders of the data4.
ordering of the dimensions in deciding along which dimension a
Under normal conditions a node sends periodic update messages
zone is to be split, so that zones can be re-merged when nodes leave.
to each of its neighbors giving its zone coordinates and a list of its
For a 2-d space a zone would first be split along the X dimension,
neighbors and their zone coordinates. The prolonged absence of an
then the Y and so on. The (key, value) pairs from the half zone to
update message from a neighbor signals its failure.
be handed over are also transfered to the new node.
Once a node has decided that its neighbor has died it initiates
the takeover mechanism and starts a takeover timer running. Each
Joining the Routing
neighbor of the failed node will do this independently, with the
Having obtained its zone, the new node learns the IP addresses of
timer initialized in proportion to the volume of the node’s own
its coordinate neighbor set from the previous occupant. This set is
zone. When the timer expires, a node sends a TAKEOVER message
a subset of the previous occupant’s neighbors, plus that occupant
conveying its own zone volume to all of the failed node’s neighbors.
itself. Similarly, the previous occupant updates its neighbor set to
On receipt of a TAKEOVER message, a node cancels its own
eliminate those nodes that are no longer neighbors. Finally, both
timer if the zone volume in the message is smaller that its own zone
the new and old nodes’ neighbors must be informed of this realloca-
volume, or it replies with its own TAKEOVER message. In this way,
tion of space. Every node in the system sends an immediate update
a neighboring node is efficiently chosen that is still alive and has a
message, followed by periodic refreshes, with its currently assigned
small zone volume5.
zone to all its neighbors. These soft-state style updates ensure that
Under certain failure scenarios involving the simultaneous fail-
all of their neighbors will quickly learn about the change and will
ure of multiple adjacent nodes, it is possible that a node detects
update their own neighbor sets accordingly. Figures 2 and 3 show
4
To prevent stale entries as well as to refresh lost entries, nodes
an example of a new node (node 7) joining a 2-dimensional CAN.
that insert (key,value) pairs into the CAN periodically refresh these
The addition of a new node affects only a small number of ex-
entries
isting nodes in a very small locality of the coordinate space. The
5
Additional metrics such as load or the quality of connectivity can
number of neighbors a node maintains depends only on the dimen-
also be taken into account, but in the interests of simplicity we
sionality of the coordinate space and is independent of the total
won’t discuss these further here.
163

a failure, but less than half of the failed node’s neighbors are still
scales as O d n1=d in keeping with the analytical results for per-
(
(
))
reachable. If the node takes over another zone under these circum-
fectly partitioned coordinate spaces.
stances, it is possible for the CAN state to become inconsistent. In
Because increasing the number of dimensions implies that a node
such cases, prior to triggering the repair mechanism, the node per-
has more neighbors, the routing fault tolerance also improves as a
forms an expanding ring search for any nodes residing beyond the
node now has more potential next hop nodes along which messages
failure region and hence it eventually rebuilds sufficient neighbor
can be routed in the event that one or more neighboring nodes crash.
state to initiate a takeover safely.
Finally, both the normal leaving procedure and the immediate
3.2
Realities: multiple coordinate spaces
takeover algorithm can result in a node holding more than one
The second observation is that we can maintain multiple, inde-
zone. To prevent repeated further fragmentation of the space, a
pendent coordinate spaces with each node in the system being as-
background zone-reassignment algorithm, which we describe in
signed a different zone in each coordinate space. We call each such
Appendix A, runs to ensure that the CAN tends back towards one
coordinate space a “reality”. Hence, for a CAN with r realities, a
zone per node.
single node is assigned r coordinate zones, one on every reality and
holds r independent neighbor sets.
3.
DESIGN IMPROVEMENTS
The contents of the hash table are replicated on every reality.
Our basic CAN algorithm as described in the previous section
This replication improves data availability. For example, say a
provides a balance between low per-node state (O d for a
pointer to a particular file is to be stored at the coordinate loca-
(
)
d
tion (x,y,z). With four independent realities, this pointer would
-dimensional space) and short path lengths with O dn1=d hops
(
)
be stored at four different nodes corresponding to the coordinates
for d dimensions and n nodes. This bound applies to the number
(x,y,z) on each reality and hence it is unavailable only when all
of hops in the CAN path. These are application level hops, not IP-
four nodes are unavailable. Multiple realities also improve rout-
level hops, and the latency of each hop might be substantial; recall
ing fault tolerance, because in the case of a routing breakdown on
that nodes that are adjacent in the CAN might be many miles and
one reality, messages can continue to be routed using the remaining
many IP hops away from each other. The average total latency of
realities.
a lookup is the average number of CAN hops times the average la-
Further, because the contents of the hash table are replicated on
tency of each CAN hop. We would like to achieve a lookup latency
every reality, routing to location (x,y,z) translates to reaching (x,y,z)
that is comparable within a small factor to the underlying IP path
on any reality. A given node owns one zone per reality each of
latencies between the requester and the CAN node holding the key.
which is at a distinct, and possibly distant, location in the coordi-
In this section, we describe a number of design techniques whose
nate space. Thus, an individual node has the ability to reach distant
primary goal is to reduce the latency of CAN routing. Not unin-
portions of the coordinate space in a single hop, thereby greatly re-
tentionally, many of these techniques offer the additional advan-
ducing the average path length. To forward a message, a node now
tage of improved CAN robustness both in terms of routing and data
checks all its neighbors on each reality and forwards the message to
availability. In a nutshell, our strategy in attempting to reduce path
that neighbor with coordinates closest to the destination. Figure 5
latency is to reduce either the path length or the per-CAN-hop la-
plots the path length for increasing numbers of nodes for different
tency. A final improvement we make to our basic design is to add
numbers of realities. From the graph, we see that realities greatly
simple load balancing mechanisms (described in Sections 3.7 and
reduce path length. Thus, using multiple realities reduces the path
3.8).
length and hence the overall CAN path latency.
First, we describe and evaluate each design feature individually
and then, in Section 4, discuss how together they affect the overall
Multiple dimensions versus multiple realities
performance. These added features yield significant improvements
but come at the cost of increased per-node state (although per-node
Increasing either the number of dimensions or realities results in
state still remains independent of the number of nodes in the sys-
shorter path lengths, but higher per-node neighbor state and main-
tem) and somewhat increased complexity. The extent to which the
tenance traffic. Here we compare the relative improvements caused
following techniques are applied (if at all) involves a trade-off be-
by each of these features.
tween improved routing performance and system robustness on the
Figure 6 plots the path length versus the average number of neigh-
one hand and increased per-node state and system complexity on
bors maintained per node for increasing dimensions and realities.
the other. Until we have greater deployment experience, and know
We see that for the same number of neighbors, increasing the di-
the application requirements better, we are not prepared to decide
mensions of the space yields shorter path lengths than increasing
on these tradeoffs.
the number of realities. One should not, however, conclude from
We simulated our CAN design on Transit-Stub (TS) topologies
these tests that multiple dimensions are more valuable than multi-
using the GT-ITM topology generator [22]. TS topologies model
ple realities because multiple realities offer other benefits such as
networks using a 2-level hierarchy of routing domains with transit
improved data availability and fault-tolerance. Rather, the point to
domains that interconnect lower level stub domains.
take away is that if one were willing to incur an increase in the av-
erage per-node neighbor state for the primary purpose of improving
3.1
Multi-dimensioned coordinate spaces
routing efficiency, then the right way to do so would be to increase
The first observation is that our design does not restrict the di-
the dimensionality d of the coordinate space rather than the number
mensionality of the coordinate space. Increasing the dimensions
of realities r.
of the CAN coordinate space reduces the routing path length, and
hence the path latency, for a small increase in the size of the coor-
3.3
Better CAN routing metrics
dinate routing table.
The routing metric, as described in Section 2.1, is the progress
Figure 4 measures this effect of increasing dimensions on rout-
in terms of Cartesian distance made towards the destination. One
ing path length. We plot the path length for increasing numbers of
can improve this metric to better reflect the underlying IP topology
CAN nodes for coordinate spaces with different dimensions. For a
by having each node measure the network-level round-trip-time
system with n nodes and d dimensions, we see that the path length
RTT to each of its neighbors. For a given destination, a message
164

#realities=1
#dimensions=2
Number of nodes = 131,072
256
256
d=2,r=2
2 dimensions
1 reality
increasing dimensions, #realities=2
3 dimensions
2 realities
25
increasing realities, #dimensions=2
4 dimensions
3 realities
128
128
5 dimensions
4 realities
64
64
20
32
r=3
32
16
15
r=4
Number of hops
16
Number of hops
Number of hops
8
d=3
r=5
r=6
8
10
r=7
4
d=4
4
d=5
2
d=6
d=7
256
1024
4096
16K
64K
256K
1M
256
1024
4096
16K
64K
256K
1M
10
15
20
25
30
Number of nodes
Number of nodes
Number of neighbors
Figure 4: Effect of dimensions on path
Figure 5: Effect of multiple realities on
Figure 6: Path length with increasing
length
path length
neighbor state
is forwarded to the neighbor with the maximum ratio of progress
as described earlier, node B first checks whether it has fewer than
to RTT. This favors lower latency paths, and helps the application
MAXPEERS peer nodes. If so, the new node A merely joins B’s
level CAN routing avoid unnecessarily long hops.
zone without any space splitting. Node A obtains both its peer
Unlike increasing the number of dimensions or realities, RTT-
list and its list of coordinate neighbors from B. Periodic soft-state
weighted routing aims at reducing the latency of individual hops
updates from A serve to inform A’s peers and neighbors about its
along the path and not at reducing the path length. Thus, our metric
entry into the system.
for evaluating the efficacy of RTT-weighted routing is the per-hop
If the zone is full (already has MAXPEERS nodes), then the zone
latency, obtained by dividing the overall path latency by the path
is split into half as before. Node B informs each of the nodes on it’s
length.
peer-list that the space is to be split. Using a deterministic rule (for
To quantify the effect of this routing metric, we used Transit-
example the ordering of IP addresses), the nodes on the peer list
Stub topologies with link latencies of 100ms for intra-transit do-
together with the new node A divide themselves equally between
main links, 10ms for stub-transit links and 1ms for intra-stub do-
the two halves of the now split zone. As before, A obtains its initial
main links. With our simulated topology, the average end-to-end la-
list of peers and neighbors from B.
tency of the underlying IP network path between randomly selected
Periodically, a node sends its coordinate neighbor a request for
source-destination nodes is approximately 115ms. Table 1 com-
its list of peers, then measures the RTT to all the nodes in that
pares the average per-hop latency with and without RTT weighting.
neighboring zone and retains the node with the lowest RTT as its
These latencies were averaged over test runs with n, the number of
neighbor in that zone. Thus a node will, over time, measure the
nodes in the CAN, ranging from 8 to 18.
round-trip-time to all the nodes in each neighboring zone and retain
2
2
As can be seen, while the per-hop latency without RTT-weighted
the closest (i.e. lowest latency) nodes in its coordinate neighbor set.
routing matches the underlying average IP network latency, RTT-
After its initial bootstrap into the system, a node can perform this
weighted routing lowers the per-hop latency by between 24% and
RTT measurement operation at very infrequent intervals so as to
40% depending on the number of dimensions. Higher dimensions
not unnecessarily generate large amounts of control traffic.
give more next-hop forwarding choices and hence even greater im-
The contents of the hash table itself may be either divided or
provements.
replicated across the nodes in a zone. Replication provides higher
availability but increases the size of the data stored at every node by
3.4
Overloading coordinate zones
a factor of MAXPEERS (because the overall space is now partitioned
into fewer, and hence larger, zones) and data consistency must be
So far, our design assumes that a zone is, at any point in time,
maintained across peer nodes. On the other hand, partitioning data
assigned to a single node in the system. We now modify this to
among a set of peer nodes does not require consistency mechanisms
allow multiple nodes to share the same zone. Nodes that share
or increased data storage but does not improve availability either.
the same zone are termed peers. We define a system parameter
Overloading zones offers many advantages:
MAXPEERS, which is the maximum number of allowable peers per
zone (we imagine that this value would typically be rather low, 3 or
reduced path length (number of hops), and hence reduced
4 for example).
path latency, because placing multiple nodes per zone has the
With zone overloading, a node maintains a list of its peers in ad-
same effect as reducing the number of nodes in the system.
dition to its neighbor list. While a node must know all the peers in
reduced per-hop latency because a node now has multiple
its own zone, it need not track all the peers in its neighboring zones.
choices in its selection of neighboring nodes and can select
Rather, a node selects one neighbor from amongst the peers in each
neighbors that are closer in terms of latency. Table 2 lists
of its neighboring zones. Thus, zone overloading does not increase
the average per-hop latency for increasing MAXPEERS for
the amount of neighbor information an individual node must hold,
system sizes ranging from 8 to 18 nodes with the same
but does require it to hold additional state for up to
2
2
MAXPEERS
Transit-Stub simulation topologies as in Section 3.3. We see
peer nodes.
that placing 4 nodes per zone can reduce the per-hop latency
Overloading a zone is achieved as follows: When a new node A
by about 45%.
joins the system, it discovers, as before, an existent node B whose
zone it is meant to occupy. Rather than directly splitting its zone
improved fault tolerance because a zone is vacant only when
165

Number of
Non-RTT-weighted
RTT-weighted
Number of nodes per zone
per-hop latency (ms)
dimensions
routing (ms)
routing (ms)
1
116.4
2
116.8
88.3
2
92.8
3
116.7
76.1
3
72.9
4
115.8
71.2
4
64.4
5
115.4
70.9
Table 1: Per-hop latency using RTT-weighted routing
Table 2: Per-hop latencies using multiple nodes per zone
all the nodes in a zone crash simultaneously (in which case
tions and so on. Previously, a new node joined the CAN at a ran-
the repair process of Section 2.3 is still required).
dom point in the entire coordinate space. Now, a new node joins
the CAN at a random point in that portion of the coordinate space
On the negative side, overloading zones adds somewhat to sys-
associated with its landmark ordering.
tem complexity because nodes must additionally track a set of peers.
The rationale behind this scheme is that topologically close nodes
3.5
Multiple hash functions
are likely to have the same ordering and consequently, will reside
in the same portion of the coordinate space and hence neighbors
For improved data availability, one could use k different hash
in the coordinate space are likely to be topologically close on the
functions to map a single key onto k points in the coordinate space
Internet.
and accordingly replicate a single (key,value) pair at k distinct nodes
The metric we use to evaluate the above binning scheme is the
in the system. A (key,value) pair is then unavailable only when all
k
ratio of the latency on the CAN network to the average latency on
replicas are simultaneously unavailable. In addition, queries for
the IP network. We call this the latency stretch. Figure 8 compares
a particular hash table entry could be sent to all k nodes in paral-
the stretch on CANs constructed with and without the above land-
lel thereby reducing the average query latency. Figure 7 plots this
mark ordering scheme. We use the same Transit-Stub topologies
query latency, i.e. the time to fetch a (key,value) pair, for increasing
as before (Section 3.3) and 4 landmarks placed at random with the
number of nodes for different numbers of hash functions.
only restriction that they must be at least 5 hops away from each
Of course, these advantages come at the cost of increasing the
other. As can be seen, landmark ordering greatly improves the path
size of the (key,value) database and query traffic (in the case of
latency.
parallel queries) by a factor of k.
A consequence of the above binning strategy is that the coordi-
Instead of querying all k nodes, a node might instead choose to
nate space is no longer uniformly populated. Because some order-
retrieve an entry from that node which is closest to it in the coordi-
ings (bins) are more likely to occur than others their corresponding
nate space.
portions of the coordinate space are also more densely occupied
3.6
Topologically-sensitive construction of the
than others leading to a slightly uneven distribution of load amongst
CAN overlay network
the nodes. The use of background load balancing techniques (as
described in Appendix A) where an overloaded node hands off a
The CAN construction mechanism described in Section 2.2 allo-
portion of its space to a more lightly loaded one could be used to
cates nodes to zones at random, and so a node’s neighbors on the
alleviate this problem.
CAN need not be topologically nearby on the underlying IP net-
These results seem encouraging and we are continuing to study
work. This can lead to seemingly strange routing scenarios where,
the effect of topology, link delay distribution, number of landmarks
for example, a CAN node in Berkeley has its neighbor nodes in Eu-
and other factors on the above scheme. Landmark ordering is work
rope and hence its path to a node in nearby Stanford may traverse
in progress. We do not discuss or make use of it further in this
distant nodes in Europe. While the design mechanisms described
paper.
in the previous sections try to improve the selection of paths on an
existing overlay network they do not try to improve the overlay net-
work structure itself. In this section, we present some initial results
3.7
More Uniform Partitioning
on our current work on trying to construct CAN topologies that are
When a new node joins, a JOIN message is sent to the owner of
congruent with the underlying IP topology.
a random point in the space. This existing node knows not only its
Our initial scheme assumes the existence of a well known set
own zone coordinates, but also those of its neighbors. Therefore,
of machines (for example, the DNS root name servers) that act as
instead of directly splitting its own zone, the existing occupant node
landmarks on the Internet. We achieve a form of “distributed bin-
first compares the volume of its zone with those of its immediate
ning” of CAN nodes based on their relative distances from this set
neighbors in the coordinate space. The zone that is split to accom-
of landmarks. Every CAN node measures its round-trip-time to
modate the new node is then the one with the largest volume.
each of these landmarks and orders the landmarks in order of in-
This volume balancing check thus tries to achieve a more uni-
creasing RTT. Thus, based on its delay measurements to the differ-
form partitioning of the space over all the nodes and can be used
ent landmarks, every CAN node has an associated ordering. With
m
with or without the landmark ordering scheme from Section 3.6.
landmarks, m such orderings are possible. Accordingly we
Since (key,value) pairs are spread across the coordinate space using
!
partition the coordinate space into m equal sized portions, each
a uniform hash function, the volume of a node’s zone is indicative
!
corresponding to a single ordering. Our current (somewhat naive)
of the size of the (key,value) database the node will have to store,
scheme to partition the space into m portions works as follows: as-
and hence indicative of the load placed on the node. A uniform par-
!
suming a fixed cyclical ordering of the dimensions (e.g. xyzxyzx...),
titioning of the space is thus desirable to achieve load balancing.
we first divide the space, along the first dimension, into m portions,
Note that this is not sufficient for true load balancing because
each portion is then sub-divided along the second dimension into
m
some (key,value) pairs will be more popular than others thus putting
portions each of which is further divided into m
por-
higher load on the nodes hosting those pairs. This is similar to the
;
1
;
2
166

#dimensions=2, #realities=1
#landmarks=4, #realities=1
25
16.0
1 hash function
2-d, with landmark ordering
100
without uniform-partitioning feature
3 hash functions
2-d, without landmark ordering
with uniform-partitioning feature
5 hash functions
4-d, with landmark ordering
14.0
4-d, without landmark ordering
20
80
12.0
10.0
15
60
8.0
10
40
6.0
Latency Stretch
Percentage of nodes
4.0
5
User-perceived Query Latency (s)
20
2.0
0
0
256
1024
4096
16K
64K
256K
256
1024
4096
V/16
V/8
V/4
V/2
V
2V
4V
8V
Number of nodes
Number of nodes
Volume
Figure 7: Reduction in user-perceived
Figure 8: Latency savings due to land-
Figure 9: Effect of Uniform Partitioning
query latency with the use of multiple
mark ordering used in CAN construction
feature on a CAN with 65,536 nodes, 3
hash functions
dimensions and 1 reality
“hot spot” problem on the Web. In Section 3.8 we discuss caching
popular data key is thus eventually replicated within a re-
and replication techniques that can be used to ease this hot spot
gion surrounding the original storage node. A node holding
problem in CANs.
a replica of a requested data key can, with a certain prob-
If the total volume of the entire coordinate space were V and n
ability, choose to either satisfy the request or forward it on
T
the total number of nodes in the system then a perfect partitioning
its way thereby causing the load to be spread over the entire
of the space among the n nodes would assign a zone of volume
V
region rather than just along the periphery.
/n to each node. We use V to denote V /n. We ran simulations
T
T
with 16 nodes both with and without this uniform partitioning fea-
As with all such schemes, cached and replicated data keys should
2
ture. At the end of each run, we compute the volume of the zone
have an associated time-to-live field and be eventually expired from
assigned to each node. Figure 9 plots different possible volumes
the cache.
in terms of V on the X axis and shows the percentage of the total
number of nodes (Y axis) that were assigned zones of a particular
4.
DESIGN REVIEW
volume. From the plot, we can see that without the uniform parti-
Sections 2 and 3 described and evaluated individual CAN design
tioning feature a little over 40% of the nodes are assigned to zones
components. The evaluation of our CAN recovery algorithms (us-
with volume V as compared to almost 90% with this feature and
ing both large scale and smaller scale ns simulations), are presented
the largest zone volume drops from V to V . Not surprisingly,
8
2
in [18]. Here we briefly recap our design parameters and metrics,
the partitioning of the space further improves with increasing di-
summarize the effect of each parameter on the different metrics and
mensions.
quantify the performance gains achieved by the cumulative effect
3.8
Caching and Replication techniques for
of all the features.
“hot spot” management
We used the following metrics to evaluate system performance:
As with files in the Web, certain (key,value) pairs in a CAN are
Path length: the number of (application-level) hops required
likely to be far more frequently accessed than others, thus overload-
to route between two points in the coordinate space.
ing nodes that hold these popular data keys. To make very popular
Neighbor-state: the number of CAN nodes for which an in-
data keys widely available, we borrow some of the caching and
dividual node must retain state.
replication techniques commonly applied to the Web.
Latency: we consider both the end-to-end latency of the to-
tal routing path between two points in the coordinate space
Caching: In addition to its primary data store (i.e. those data
and the per-hop latency, i.e., latency of individual application
keys that hash into its coordinate zone), a CAN node main-
level hops obtained by dividing the end-to-end latency by the
tains a cache of the data keys it recently accessed. Before
path length.
forwarding a request for a data key towards its destination,
Volume: the volume of the zone to which a node is assigned,
a node first checks whether the requested data key is in its
that is indicative of the request and storage load a node must
own cache and if so, can itself satisfy the request without
handle.
forwarding it any further. Thus, the number of caches from
Routing fault tolerance: the availability of multiple paths
which a data key can be served grows in direct proportion to
between two points in the CAN.
its popularity and the very act of requesting a data key makes
it more widely available.
Hash table availability: adequate replication of a (key,value)
entry to withstand the loss of one or more replicas.
Replication: A node that finds it is being overloaded by re-
The key design parameters affecting system performance are:
quests for a particular data key can replicate the data key
at each of its neighboring nodes. Replication is thus an ac-
dimensionality of the virtual coordinate space: d
tive pushing out of popular data keys as opposed to caching,
number of realities: r
which is a natural consequence of requesting a data key. A
number of peer nodes per zone: p
167

Parameter
“bare bones”
“knobs on full”
CAN
CAN
d
2
10
r
1
1
Metric
“bare bones” CAN
“knobs on full CAN”
p
0
4
path length
198.0
5.0
k
1
1
# neighbors
4.57
27.1
RTT weighted
OFF
ON
# peers
0
2.95
routing metric
IP latency
115.9ms
82.4ms
Uniform
OFF
ON
CAN path latency
23,008ms
135.29ms
partitioning
Landmark
OFF
OFF
Table 5: CAN Performance Results
ordering
Table 4: CAN parameters
number of hash functions (i.e. number of points per reality
3.2
at which a (key,value) pair is stored): k
H(100,10,1)
H(20,5,2)
use of the RTT-weighted routing metric
R(10,50)
3
10xH(20,5,2)
use of the uniform partitioning feature described in Section 3.7
2.8
In some cases, the effect of a design parameter on certain met-
2.6
rics can be directly inferred from the algorithm; in all other cases
we resorted to simulation. Table 3 summarizes the relationship be-
2.4
tween the different parameters and metrics. A table entry marked
“-” indicates that the given parameter has no significant effect on
2.2
Latency Stretch
that metric, while
and
indicate an increase and decrease respec-
"
#
2
tively in that measure caused by an increase in the corresponding
parameter. The figure numbers included in certain table entries re-
1.8
fer to the corresponding simulation results.
1.6
To measure the cumulative effect of all the above features, we se-
lected a system size of n= 18 nodes and compared two algorithms:
1.4
2
16K
32K
65K
131K
Number of nodes
1. a “bare bones” CAN that does not utilize most of our addi-
tional design features
2. a “knobs-on-full” CAN making full use of our added features
Figure 10: Effect of link delay distribution on CAN latency
(without the landmark ordering feature from Section 3.7)
The topology used for this test is a Transit-Stub topology with
to the edges of the topology without scaling the backbone topol-
a delay of 100ms on intra-transit links, 10ms on stub-transit links
ogy itself. This effectively grows the density at the edges of the
and 1ms on intra-stub links (i.e. 100ms on links that connect two
topology. We found, that as n grows, the total path latency grows
transit nodes, 10ms on links that connect a transit node to a stub
even more slowly than n1=d (with d
in this case) because al-
=
10
node and so forth). Tables 4 and 5 list the values of the parameters
though the path length grows slowly as n1=10 (from 4.56 hops with
and metrics for each test. 6
14
nodes to 5.0 with 18 hops) the latency of the additional hops
2
2
We find these results encouraging as they demonstrate that for a
is lower than the average latency since the added hops are along
system with over 260,000 nodes we can route with a latency that is
low-latency links at the edges of the network.
well within a factor of two of the underlying network latency. The
Extrapolating this scaling trend and making the pessimistic as-
number of neighbors that a node must maintain to achieve this is
sumption that the total latency grows with the increase in path
approximately 30 (27.1 + 2.95) which is definitely on the high side
length (i.e., as n1=10) we could potentially scale the size of the
but not necessarily unreasonable. The biggest gain comes from
system by another 10, reaching a system size of close to a billion
2
increasing the number of dimensions, which lowers the path length
nodes, before seeing the path latency increase to within a factor of
from 198 to approximately 5 hops. However, we can see that the
four of the underlying network latency.
latency reduction heuristics play an important role; without latency
To better understand the effect of link delay distributions on the
heuristics, the end-to-end latency would be close to
ms (#
above results, we repeated the “knobs-on-full” test for different de-
5
115
hops
# latency-per-hop).
lay distributions on the Transit-Stub topologies. We used the fol-
We repeated the above “knobs-on-full” simulation and varied the
lowing topologies:
system size n from 14 to 18. In scaling the CAN system, we
2
2
H
scaled the topology by scaling the number of CAN nodes added
: A Transit-Stub topology with a hierarchical
(100
10
1)
link delay assignment of 100ms on intra-transit links, 10ms
6
The reason the IP latency is 82ms (in the “knobs-on-full” test)
on transit-stub links and 1ms on intra-stub links. This is the
instead of 115ms is not because the average latency of the physical
topology used in the above “knobs-on-full” test.
network is lower but because our CAN algorithm (because of the
use of zone overloading and RTT-weighted routing) automatically
H
retrieves an entry from the closest replica. 82ms represents the
: A Transit-Stub topology with a hierarchical link
(20
5
2)
average IP network level latency from the retrieving node to this
delay assignment of 20ms on intra-transit links, 5ms on transit-
closest replica.
stub links and 2ms on intra-stub links.
168

feature
use
metric
use
number
zone:
number
realities:
dimensions:
Design
R
: A Transit-Stub topology with the delay of every
(10
50)
of
of
link set to a random value between 10ms to 50ms.
p
R
P
uniform
of
T
r
of
arameters
xH
T
: This topology is the same as H
hash
10
(20
5
2)
(20
5
2)
-weighted
d
peer
except that the backbone topology is scaled by a factor of 10
functions:
which implies that the density of CAN nodes on the resultant
partitioning
nodes
topology is about 10 times lower.
routing
For each of the above topologies, we measure the latency stretch
k
per
- the ratio of CAN latency to IP latency - for different system sizes.
The results are shown in Figure 10. We see that while the delay
v
reduced
-
-
distribution affects the absolute value of the latency stretch, in all
ariance
O
#
(fig:
O (hops) length path
cases, the latency stretch grows very slowly with system size. In
(1
(fig:
dn
(
=p
4)
no case do we see a latency stretch of more than 3 for system sizes
1
)
5)
=d
up to 130,000 nodes. The fastest growth is in the case of random
)
delay distributions. This is because in this case, as we grow the
v
reduced
-
-
ariance
O
O
O
state
neighbor
p
CAN system size, the new links added at the edges of the network
(
need not be low latency links (unlike with the hierarchical delay
)
r
(
)
d
(
)
distributions). Finally, we see that latency stretch with topology
H
is slightly lower than with topology
xH
.
(20
5
2)
10
(20
5
2)
T
-
latenc
This is due to the higher density of CAN nodes in the case of
#
#
hop
length
#
length)
#
length)
#
total
able
(due
(fig:
H
(due
(due
(due
; higher densities allow the latency heuristics to yield
(20
5
2)
latenc
path
y)
higher gains.
3:
to
7)
and
Ef
to
to
to
reduced
y)
latenc
f
ect

reduced
reduced
reduced
reduced
5.
RELATED WORK
of
y
We categorize related work as related algorithms in the litera-
design
per
ture relevant to data location and related systems that involve a data
-hop
per
path
path
path
location component.
-
parameters
5.1
Related Algorithms
-
#
-
#
-
-
tenc
per
(table:
(table:
The Distance Vector (DV) and Link State (LS) algorithms used
-hop
y
in IP routing require every router to have some level of knowledge
1)
2)
(the exact link structure in the case of LS and the distance in hops
on
la-
for DV) of the topology of entire network. Unlike our CAN routing
perf
algorithm, DV and LS thus require the widespread dissemination
reduced
-
store:
cated:
O
replicated
O
-
size
ormance
k
of local topology information. While well suited to IP networks
(
r
(
wherein topology changes are infrequent, for networks with fre-
)
)
of
-
O
data
quent topology changes, DV and LS would result in the frequent
v
ariance
propagation of routing updates. Because we wanted our CAN de-
metrics
p
(
)
data
store
,
sign to scale to large numbers of potentially flaky nodes we chose
partitioned
not to use routing schemes such as DV and LS.
(fig:
store
Another goal in designing CANs was to have a truly distributed
9)
routing algorithm, both because this does not stress a small set of
repli-
nodes and because it avoids a single point of failure. Hence we
data
avoided more traditional hierarchical routing algorithms [16, 19,
11, 3].
-
-
-
neighbors)
Perhaps closest in spirit to the CAN routing scheme is the Plax-
"
"
"
ance
routing
(due
ton algorithm [15]. In Plaxton’s algorithm, every node is assigned
a unique n bit label. This n bit label is divided into l levels, with
to
f
n=l
ault
each level having w
bits. A node with label, say xyz, where
=
x,y and z are w bit digits, will have a routing table with:
backup
toler
w
entries of the form:
X X
2
-
w
entries of the form:
x X
2
-
-
store:
O
O replicated
O
-
data
k
w
entries of the form:
x y
2
(
p
(
)
r
(
)
)
store
,
-
where we use the notation
to denote every digit in
::: w ,
0
2
;
1
partitioned
and X to denote any digit in
::: w .
a
0
2
;
1
v
data
ailability
Using the above routing state, a packet is forwarded towards a
destination label node by incrementally “resolving” the destination
label from left to right, i.e., each node forwards a packet to a neigh-
store:
data
bor whose label matches (from left to right) the destination label in
one more digit than its own label does.
169

For a system with n nodes, Plaxton’s algorithm routes in O
n
priate server. OceanStore uses the Plaxton algorithm as the basis
(log
)
hops and requires a routing table size that is O
n . CAN rout-
for its data location scheme. The Plaxton algorithm was described
(log
)
ing by comparison routes in O dn1=d hops (where d is dimen-
above.
(
)
sions) with routing table size O dr which is independent of n. As
(
)
5.2.3
Publius
mentioned earlier, setting d
n = allows our CAN algo-
=
(log
)
2
2
rithm to match Plaxton’s scaling properties. Plaxton’s algorithm
Publius [13] is a Web publishing system that is highly resis-
addresses many of the same issues we do. As such it was a natural
tant to censorship and provides publishers with a high degree of
candidate for CANs and, early into our work, we seriously consid-
anonymity. The system consists of publishers who post Publius
ered using it. However, on studying the details of the algorithm,
content to the web, servers that host random-looking content, and
we decided that it was not well-suited to our application. This is
retrievers that browse Publius content on the web. The current Pub-
primarily because the Plaxton algorithm was originally proposed
lius design assumes the existence of a static, system-wide list of
for web caching environments which are typically administratively
available servers. The self-organizing aspects of our CAN design
configured, have fairly stable hosts and maximal scales on the or-
could potentially be incorporated into the Publius design allowing
der of thousands. While the Plaxton algorithm is very well suited to
it to scale to large numbers of servers. We thus view our work as
such environments, the peer-to-peer contexts we address are quite
complementary to the Publius project.
different. We require a self-configuring system which is capable
5.2.4
Peer-to-peer file sharing systems
of dealing with a very large set of hosts (millions), many of them
potentially quite flaky. However, because the targeted application
Section 1 described the basic operation of the two most widely
is web caching, the Plaxton algorithm does not provide a solution
deployed peer-to-peer file sharing systems; Napster and Gnutella.
whereby nodes can independently discover their neighbors in a de-
We now describe a few more systems in this space that use novel
centralized manner. In fact, the algorithm requires global knowl-
indexing schemes. Although many of these systems address ad-
edge of the topology to achieve a consistent mapping between data
ditional, related problems such as security, anonymity, keyword
objects and the Plaxton nodes holding those objects. Addition-
searching etc., we focus here on their solutions to the indexing
ally, every node arrival and departure affects a logarithmic number
problem.
of nodes which, for large systems with high arrival and departure
Freenet [5, 2] is a file sharing application that additionally pro-
rates, appears to be on the high side because nodes could be con-
tects the anonymity of both authors and readers. Freenet nodes hold
stantly reacting to changes in system membership.
3 types of information: keys (which are analogous to web URLs),
Algorithms built around the concept of geographic routing [9,
addresses of other Freenet nodes that are also likely to know about
12] are similar to our CAN routing algorithm in that they build
similar keys, and optionally the data corresponding to those keys.
around the notion of forwarding messages through a coordinate
A node that receives a request for a key for which it does not know
space. The key difference is that the “space” in their work refers
the exact location forwards the request to a Freenet node that it
to true physical space because of which there is no neighbor dis-
does know about, and whose keys are closer to the requested key.
covery problem (i.e. a node’s neighbors are those that lie in its ra-
Results for both successful and failed searches backtrack along the
dio range). These algorithms are very well suited to their targeted
path the request travelled. If a node fails to locate the desired con-
applications of routing and location services in ad-hoc networks.
tent, it returns a failure message back to its upstream node which
Applying such algorithms to our CAN problem would require us to
will then try the alternate downstream node that is its next best
construct and maintain neighbor relationships that would correctly
choice. In this way, a request operates as a steepest-ascent hill-
mimic geographic space which appears non trivial (for example,
climbing search with backtracking. The authors hypothesize that
GPSR performs certain planarity checks which would be hard to
the quality of the routing should improve over time, for two rea-
achieve without a physical radio medium). Additionally, such ge-
sons. First, nodes should come to specialize in locating sets of
ographic routing algorithms are not obviously extensible to multi-
similar keys because a node listed in routing tables under a partic-
dimensional spaces.
ular key will tend to receive mostly requests for similar keys. Also,
because of backtracking, it will become better informed in its rout-
5.2
Related Systems
ing tables about which other nodes carry those keys. Second, nodes
should become similarly specialized in storing clusters of files hav-
5.2.1
Domain Name System
ing similar keys. This is because forwarding a request successfully
The DNS system in some sense provides the same functionality
will result in the node itself gaining a copy of the requested file,
as a hash table; it stores key value pairs of the form (domain name,
and most requests will be for similar keys and hence the node will
IP address). While a CAN could potentially provide a distributed
mostly acquire files with similar keys. The scalability of the above
DNS-like service, the two systems are quite different. In terms of
algorithm is yet to be fully studied.
functionality, CANs are more general than the DNS. The current
Ongoing work at UCB7 looks into developing a peer-to-peer file
design of the DNS closely ties the naming scheme to the manner in
sharing application using a location algorithm similar to the Plax-
which a name is resolved to an IP address, CAN name resolution
ton algorithm (although developed independently from the Plaxton
is truly independent of the naming scheme. In terms of design, the
work). A novel aspect of their work is the randomization of path
two systems are very different.
selection for improved robustness.
A description and evaluation of these and other file sharing ap-
5.2.2
OceanStore
plications can be found at [23]. A key difference between our CAN
algorithm and most of these file sharing systems is that under nor-
The OceanStore project at U.C.Berkeley [10] is building a utility
mal operating conditions, content that exists within the CAN can al-
infrastructure designed to span the globe and provide continuous
ways be located by any other node because there is a clear “home”
access to persistent information. Servers self-organize into a very
(point) in the CAN for that content and every other node knows
large scale storage system. Data in OceanStore can reside at any
what that home is and how to reach it. With systems such as [2, 6]
server within the OceanStore system and hence a data location al-
gorithm is needed to route requests for a data object to an appro-
7
Private communication with Adam Costello
170

however it is quite possible that even with every node in the sys-
[9] B. Karp and H. Kung. Greedy Perimeter Stateless Routing.
tem behaving correctly, content may not be found either because
In Proceedings of ACM Conf. on Mobile Computing and
content is beyond the horizon of a particular node [6] or because
Networking (MOBICOM), Boston, MA, 2000. ACM.
different nodes have different, inconsistent views of the network
[10] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton,
[2]. Whether this is an important distinguishing factor depends of
D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon,
course on the nature of an application’s goals.
W. Weimer, C. Wells, and B. Zhao. Oceanstore: An
Architecture for Global-scale Persistent Storage. In
6.
DISCUSSION
Proceedings of ASPLOS 2000, Cambridge, Massachusetts,
Nov. 2000.
Our work, so far, addresses two key problems in the design of
[11] S. Kumar, C. Alaettinoglu, and D. Estrin. SCOUT: Scalable
Content-Addressable Networks: scalable routing and indexing. Our
Object Tracking through Unattended Techniques. In
simulation results validate the scalability of our overall design - for
Proceedings of the Eight IEEE International Conference on
a CAN with over 260,000 nodes, we can route with a latency that
Network Protocols, Osaka, Japan, Nov. 2000.
is less than twice the IP path latency.
Certain additional problems remain to be addressed in realizing
[12] J. Li, J. Jannotti, D. D. Couto, D. Karger, and R. Morris. A
a comprehensive CAN system. An important open problem is that
Scalable Location Service for Geographic Ad-hoc Routing.
of designing a secure CAN that is resistant to denial of service at-
In Proceedings of ACM Conf. on Mobile Computing and
tacks. This is a particularly hard problem because (unlike the Web)
Networking (MOBICOM), Boston, MA, 2000. ACM.
a malicious node can act, not only as a malicious client, but also
[13] A. D. R. Marc Waldman and L. F. Cranor. Publius: A
as a malicious server or router. A number of ongoing projects both
Robust, Tamper-evident, Censorship-resistant, Web
in research and industry are looking into the problem of building
Publishing System. In Proceedings of the 9th USENIX
large-scale distributed systems that are both secure and resistant to
Security Symposium, pages 59–72, August 2000.
denial-of-service attacks [13, 10, 2].
[14] Napster. http://www.napster.com.
Additional related problems that are topics for future work in-
[15] C. Plaxton, R. Rajaram, and A. W. Richa. Accessing nearby
clude the extension of our CAN algorithms to handle mutable con-
copies of replicated objects in a distributed environment. In
tent, and the design of search techniques [8, 21] such as keyword
Proceedings of the Ninth Annual ACM Symposium on
searching built around our CAN indexing mechanism.
Parallel Algorithms and Architectures (SPAA), June 1997.
Our interest in exploring the scalability of our design, and the
[16] J. B. Postel. Internet Protocol Specification. ARPANET
difficulty of conducting truly large scale experiments (hundreds of
Working Group Requests for Comment, DDN Network
thousands of nodes), led us to initially evaluate our CAN design
Information Center, SRI International, Menlo Park, CA,
through simulation. Now that simulation has given us some under-
Sept. 1981. RFC-791.
standing of the scaling properties of our design, we are, in collabo-
[17] S. Ratnasamy, P. Francis, M. Handley, R. Karp, J. Padhye,
ration with others, embarking on an implementation project to build
and S. Shenker. Grass-roots Content Distribution: RAID
a file sharing application that uses a CAN for distributed indexing.
meets the Web. Jan. 2001. unpublished document available
at http://www.aciri.org/sylvia/.
7.
ACKNOWLEDGMENTS
[18] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and
The authors would like to thank Steve McCanne, Jitendra Pad-
S. Shenker. A Scalable Content-Addressable Network. In
hye, Brad Karp, Vern Paxson, Randy Katz, Petros Maniatis and the
ICSI Technical Report, Jan. 2001.
anonymous reviewers for their useful comments.
[19] Y. Rekhter and T. Li. A Border Gateway Protocol 4 BGP-4.
ARPANET Working Group Requests for Comment, DDN
8.
REFERENCES
Network Information Center, Mar. 1995. RFC-1771.
[20] I. Stoica, R. Morris, D. Karger, F. Kaashoek, H.
[1] W. Bolosky, J. Douceur, D. Ely, and M. Theimer. Feasibility
Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup
of a Serverless Distributed File System Deployed on an
Service for Internet Applications. In Proceedings ACM
existing set of Desktop PCs. In Proceedings of SIGMETRICS
Sigcomm 2001, San Diego, CA, Aug. 2001.
2000, Santa Clara, CA, June 2000.
[21] M. Welsh, N. Borishov, J. Hill, R. von Behren, and A. Woo.
[2] I. Clarke, O. Sandberg, B. Wiley, and T. Hong. Freenet: A
Querying large collections of music for similarity. Technical
Distributed Anonymous Information Storage and Retrieval
report, University of California, Berkeley, CA, Nov. 1999.
System. ICSI Workshop on Design Issues in Anonymity and
[22] E. Zegura, K. Calvert, and S. Bhattacharjee. How to Model
Unobservability, July 2000.
an Internetwork. In Proceedings IEEE Infocom ’96, San
[3] S. Czerwinski, B. Zhao, T. Hodes, A. Joseph, and R. H. Katz.
Francisco, CA, May 1996.
An Architecture for a Secure Service Discovery Service. In
[23] Zeropaid.com. File sharing portal at
Proceedings of Fifth ACM Conf. on Mobile Computing and
http://www.zeropaid.com.
Networking (MOBICOM), Seattle, WA, 1999. ACM.
[4] P. Francis. Yoid: Extending the Internet Multicast
Architecture. Unpublished paper, available at
http://www.aciri.org/yoid/docs/index.html, Apr. 2000.
[5] FreeNet. http://freenet.sourceforge.net.
[6] Gnutella. http://gnutella.wego.com.
[7] J. Guterman. Gnutella to the Rescue ? Not so Fast, Napster
fiends. Link to article at http://gnutella.wego.com, Sept.
2000.
[8] Infrasearch. http://www.infrasearch.com.
171

Number of dimensions
avg(# hops)
max(# hops)
2
1.12
3
2
3
1.09
3
1
8
4
1.07
3
3
8
Table 6: Background zone reassignment
4
6
10
9
9
1
11
5
7
let d be the last dimension along which node I’s zone was
k
2 3
4
5
6 7
10
11
halved (this can be easily detected by merely searching for the
highest ordered dimension with the shortest coordinate span).
Figure 11: Example depth-first search for a replacement node
from its coordinate routing table, node I selects a neighbor
node J that abuts I along dimension d such that J belongs
k
to the zone that forms the other half to I’s zone by the last
APPENDIX
split along dimension d .
k
A.
CAN MAINTENANCE: BACKGROUND
if the volume of J’s zone equals I’s volume, then I and J are
a pair of sibling leaf nodes whose zones can be combined.
ZONE REASSIGNMENT
If J’s zone is smaller than I’s then I forwards a depth-first
The immediate takeover algorithm described in Section 2.3 may
search request to node J, which then repeats the same steps.
result in a single node being assigned multiple zones. Ideally, we
would like to retain a one-to-one assignment of nodes to zones,
This process repeats until a pair of sibling nodes is found.
because this prevents the coordinate space from becoming highly
We used simulation to measure the number of steps a depth-first
fragmented. To achieve this one-to-one node to zone assignment,
search request has to travel before sibling leaf nodes can be found.
we use a simple algorithm that aims at maintaining, even in the face
Table 6 lists the number of hops away from itself that a node
of node failures, a dissection of the coordinate space that could have
would have to search in order to find a node it can hand off an extra
been created solely by nodes joining the system.
zone to. Because of the more or less uniform partitioning of the
At a general step we can think of each existing zone as a leaf of
space (due to our uniform partitioning feature from Section 3.7), a
a binary “partition tree.” The internal vertices in the tree represent
pair of sibling nodes is typically available very close to the request-
zones that no longer exist, but were split at some previous time. The
ing node, i.e., the dissection tree is well balanced.
children of a tree vertex are the two zones into which it was split.
Of course we don’t maintain this partition tree as a data structure,
but it is useful conceptually.
By an abuse of notation, we use the same name for a leaf ver-
tex, for the zone corresponding to that leaf vertex, and for the node
responsible for that zone. The partition tree, like any binary parti-
tion tree, has the property that in the subtree rooted at any internal
vertex there are two leaves that are siblings.
Now suppose a node wants to hand-off a leaf x. If the sibling
of this leaf is also a leaf (call it y) the hand-off is easy: simply
coalesce leaves x and y, making their former parent vertex a leaf,
and assign node y to that leaf. Thus zones x and y merge into a
single zone which is assigned to node y. If x’s sibling y is not a
leaf, perform a depth-first search in the subtree of the partition tree
rooted at y until two sibling leaves are found. Call these leaves z
and w. Combine z and w, making their former parent a leaf. Thus
zones z and w are merged into a single zone, which is assigned to
node z, and node w takes over zone x.
Figure 11 illustrates this reassignment process. Let us say node
9 fails and by the immediate takeover algorithm node 6 takes over
node 9’s place. By the background reassignment process, node 6
discovers sibling nodes 10 and 11. One of these, say 11 takes over
the combined zones 10 and 11, and 10 takes over what was 9’s
zone.
While the partition tree data structure helps us explain the re-
quired transformations, its global nature makes it unsuitable for
actual implementation. Instead we must effect the required trans-
formations using purely local operations. All an individual node
actually has is its coordinate routing table which captures the adja-
cency structure among the current zones (the leaves of the deletion
tree). However, this adjacency structure is sufficient for emulation
of all the operations on the partition tree.
A node I performs the equivalent of the above described depth-
first search on the partition as follows:
172