Implementing Declarative Overlays
Implementing Declarative Overlays
Boon Thau Loo∗
Tyson Condie∗
Joseph M. Hellerstein
UC Berkeley
UC Berkeley
Intel Research Berkeley
UC Berkeley
Petros Maniatis
Timothy Roscoe
Ion Stoica
Intel Research Berkeley
Intel Research Berkeley
UC Berkeley
ABSTRACT
1.
INTRODUCTION
Overlay networks are used today in a variety of distributed
Large-scale distributed systems inherently use one or more
systems ranging from file-sharing and storage systems to
application-level overlay networks as part of their operation.
communication infrastructures. However, designing, build-
In some cases, the overlay is prominent: for example, file-
ing and adapting these overlays to the intended application
sharing networks maintain neighbor tables to route queries.
and the target environment is a difficult and time consuming
In other systems the overlay or overlays may not be as ex-
process.
plicit: for example, Microsoft Exchange email servers within
To ease the development and the deployment of such over-
an enterprise maintain an overlay network among themselves
lay networks we have implemented P2, a system that uses a
using a link-state algorithm over TCP for routing mail and
declarative logic language to express overlay networks in a
status messages.
highly compact and reusable form. P2 can express a Narada-
This paper describes P2, a facility (deployable as a service
style mesh network in 16 rules, and the Chord structured
or library) for the declarative construction, maintenance,
overlay in only 47 rules. P2 directly parses and executes such
and sharing of overlay networks. Applications submit to P2
specifications using a dataflow architecture to construct and
a concise logical description of an overlay network, and P2
maintain overlay networks. We describe the P2 approach,
executes this to maintain routing data structures, perform
how our implementation works, and show by experiment its
resource discovery, and optionally provide forwarding for the
promising trade-off point between specification complexity
overlay.
and performance.
P2 is intended to greatly simplify the process of selecting,
implementing, deploying and evolving an overlay network
design. It is novel in (a) using a declarative logic language
Categories and Subject Descriptors
for specifying overlays, and (b) employing a dataflow frame-
C.2.4 [Computer Communication Networks]: Distributed
work at runtime for maintaining overlays instead of the more
Systems—distributed applications; D.4.7 [Operating Sys-
conventional finite state machine approach. P2 automat-
tems]: Organization and Design—Distributed systems; C.2.2
ically compiles the declarative specification of the overlay
[Computer Communication Networks]: Network Pro-
into a dataflow program, and can compile multiple overlay
tocols—protocol architecture, routing protocols
specifications into a single dataflow.
We believe that these innovations together promise two
advantages: ease of specification, and sharing/reuse of code.
General Terms
P2 overlay descriptions can be extremely concise. For exam-
Design, Experimentation, Languages
ple, Chord [34] can be specified in 47 simple logic rules, ver-
sus thousands of lines of code for the MIT Chord reference
implementation and more than 320 statements in MACE-
Keywords
DON [30], which is a much less complete implementation
Declarative overlays, dataflow engines, executable pseudocode
than ours. Also, the high-level, declarative nature of P2
specifications means that they decompose cleanly into log-
∗Boon Thau Loo and Tyson Condie are supported in part by
ically reusable units: for example, a Symphony DHT [23]
the National Science Foundation under Grants No. 0205647,
might share many of the definitions in the Chord specifica-
0209108, and 0225660, and by a gift from Microsoft Corpo-
tion.
ration.
This facilitates not only code reuse among systems, but
also the comparison, extension, and hybridization of overlay
designs within a single system.
Moreover, describing over-
lays declaratively (effectively as queries) enables the natural
integration of distributed information-gathering tasks like
resource discovery and network status monitoring.
Unlike some other proposals for overlay toolkits, P2 does
not aim for performance results as good as optimized C,
C++, or Java implementations. Instead, our first aim is
Appears in the 20th ACM Symposium on Operating Systems Principles,
to demonstrate that declarative overlay descriptions can be
Brighton UK, October 2005.
1
implemented by P2 with acceptable performance, and that
invariant properties must be maintained via asynchronous
there are benefits to the declarative specification that go
messaging. Unfortunately, graph-theoretic descriptions tend
beyond the raw performance of a single overlay design. We
to be expressed at a high level in natural language, and often
believe that this is useful for rapidly prototyping new ideas,
gloss over details of the actual runtime messaging. As a re-
and eventually for deploying production systems as well.
sult, implementing structure-centric overlays often requires
This is not to say that P2 code is slow. P2’s memory
a fair bit of engineering [10, 29], and different implemen-
footprint running a full Chord implementation is relatively
tations of the same overlay can vary significantly in their
small (about 800 kB of working set) and its CPU usage
actual execution.
is comparable to C++ implementations. However, the P2
P2 spans the two approaches above, and expands upon
specifications we discuss in this paper support a new design
them in a way that we believe is particularly attractive for
point on the trade-off between a high degree of specification
overlay specification and runtime. The interface of P2 is
compactness and the fine-grained timer tuning and adaptiv-
closer in spirit to the structure-centric approach, in that
ity optimizations that pepper the code of mature, efficient
it encourages the specification of overlays as logical struc-
but painstaking overlay implementations.
tures with invariants. However, it also automatically com-
Ultimately, our argument for P2 is similar to the argu-
piles this specification to a dataflow program for managing
ment for SQL and relational database management systems
asynchronous messages, which looks closer to the protocol-
some 35 years ago. The initial goals of our implementation
centric approach. We believe P2 improves upon previous
are also akin to those of the early relational database sys-
overlay specification work in either camp, by providing a
tems: to explore the feasibility of the declarative approach
machine-interpretable description language based on rela-
in practice at a coarse grain, without trying to capture all
tions among node states in the network, and by using a
possible optimizations in the first generation of the system.
dataflow runtime model instead of automaton-based proto-
cols.
1.1
Contributions and Overview
Here, we provide a high-level view of the three compo-
This paper makes the following contributions. First, we
nents of our approach: the use of relational tables to rep-
show how a diverse set of overlays can be expressed concisely
resent overlay state, our high-level declarative language to
in a declarative specification language. Second, we show
specify the overlay’s logical properties and behavior, and
how such specifications can usefully be executed as overlay
graphs of dataflow elements to represent runtime informa-
maintenance protocols – sharing communication, state, and
tion processing. The specific implementation details of these
computation – by our implementation, the P2 distributed
components are deferred until Section 3.
dataflow engine.
Finally, we demonstrate experimentally
that such overlays have acceptable performance compared
2.1
Tables and Streams
to hand-coded implementations.
We model an overlay as a distributed data structure, rep-
The rest of this paper is structured as follows. In Sec-
resented via a set of structured relations (sets of tuples) as
tion 2 we outline the main features of our approach: using
in a relational database. P2 employs two types of relations:
a declarative logic language to specify an overlay, and com-
soft-state tables, and streams of transient tuples, as in stream
piling it to an executable graph of dataflow elements. We
query engines [4, 7, 26].
contrast this approach to the typical techniques from the
There are many ways to represent network graphs, but
literature. In Section 3 we discuss our implementation of
the relational approach seems attractive for a variety of rea-
P2 and the specific challenges we encountered, and then in
sons. First, structured tables are a simple and natural rep-
Section 4 we examine in detail a relatively complex overlay
resentation for network state; for example, neighbor tables
(Chord [34]) as implemented over P2. Section 5 evaluates
are widely used in networks. Second, and more importantly
the performance of this network, and shows it to be accept-
for our purposes, tables and relationships between them are
able despite the simplicity of the specification. Section 6
easy to represent concisely in a declarative language, as the
situates our work in the context of other language-based ap-
success of SQL has shown. Third, the distributed database
proaches and related research in data processing systems.
abstraction provides a consistently–named view of all the lo-
We conclude in Section 7.
cal tables and messages at different nodes: queries and rules
can specify distributed state in a high-level, concise way.
2.
APPROACH
Finally, the relational abstraction is a natural way to reuse
In this section we provide a broad overview of the P2 ap-
functionality and share routing state among different over-
proach to overlay specification and runtime execution. In
lays. Tables with multiple indices can store tuples relevant
the past, overlay networks have typically been characterized
to several overlays or parts of overlays, which can select
in one of two ways. The protocol-centric approach favored by
elements from each table with their own criteria. For in-
MACEDON [30] traces its roots to event languages [13, 35]
stance, a table holding network links along with their mea-
that specify overlay execution via automata for event and
sured capacity and latency can be shared between a latency-
message handling. This style emphasizes the dynamics of
conscious overlay as well as a capacity-conscious overlay. Ta-
the overlay and its maintenance, but makes it difficult to de-
ble names (with appropriate namespace scoping) provide a
termine the overlay’s coarse structure and invariant proper-
natural way to share definitions between multiple overlay
ties. The alternative is a structure-centric approach, whose
specifications.
roots can be traced to the specification of parallel inter-
Our experience with overlay implementations has shown
connection networks [19]. This style, which has influenced
that relations, together with some suitable mechanisms for
the literature on distributed hash tables (DHTs), specifies
selecting tuples from each table, can fairly naturally repre-
overlays by focusing on a network graph structure (hyper-
sent the persistent routing state of the overlays we consid-
cube, torus, de Bruijn graph, small-world graph, etc.), whose
ered. We give examples later in support of this claim.
2
2.2
The OverLog language
from network partitions.
Having established our data model, we turn our atten-
An OverLog program is largely composed of table decla-
tion to the P2 specification language for overlays. As noted
ration statements and rules; we consider each in turn. As
above, we choose to specify overlays declaratively via a logic
in Datalog, the number and types of fields in relations are
language. Our language, which we term OverLog, is based
inferred from their (consistent) use in the program’s rules.
on the widely-used Datalog [2] query language.
However, unlike Datalog, tables must be defined explicitly
A few preliminary remarks are in order to frame the dis-
in OverLog via “materialization” statements, which specify
cussion that follows. Datalog itself is a general declarative
constraints on the size and soft-state lifetime of tuple storage
query language – essentially a subset of Prolog free from
– any relations not declared as tables are treated as named
operational (imperative) constructs. OverLog is not a pure
streams of tuples. For example, the declarations:
logic language like Datalog; we add constructs to specify
materialize(neighbor, 120, infinity, keys(2)).
physical distribution properties (in particular, where tuples
materialize(member, 120, infinity, keys(2)).
are physically generated, stored, or sent), continuous queries
materialize(sequence, infinity, 1, keys(2)).
over streams as well as tables, and deletion of tuples from
tables.
specify that neighbor and member are tables whose tuples
Note that OverLog is not designed as a Domain-Specific
are retained for 120 seconds and have unbounded size, while
Language for overlay specification; it is simply an adapta-
sequence allows a single entry that does not expire. The
tion of a powerful query language to a distributed context
keys(...) construct specifies the position of the tuple field
of data and messages.
Our motivation for the design of
or fields making up the primary key of each table. Each
OverLog was to investigate which language features are of
tuple within a table has unique primary-key fields.
particular value for specifying the properties of overlay net-
Much like Datalog and Prolog, OverLog rules have the
works, and so lay the groundwork for a future, dedicated
form [<ruleID> <head> :- <body>.] where the <body> is
overlay description language. We reflect on the suitability
a list of relations (“predicates”) over constants and variables,
of OverLog for overlay specification in Section 4.1.
and the <head> defines a set of tuples derived by variable
Despite these caveats, overlay descriptions in OverLog are
assignments satisfying the body’s predicates. The order in
remarkably concise, especially considering that they can be
which the rules are presented is immaterial. The commas
directly translated by P2 into dataflow graphs that maintain
separating the predicates in a <body> are interpreted as
overlay networks. In the rest of this section, we introduce
logical conjuncts (AND), and the order in which predicates
OverLog progressively by example, giving a specification of
appear in a <body> has no semantic significance. Following
a mesh overlay like that used by Narada [8] in 16 Over-
Prolog and Datalog, names for tuples, predicates, function
Log rules. Later in the paper we compare the performance
symbols, and constants in OverLog begin with a lower-case
of our full Chord implementation, specified in 47 rules, to
letter, while variable names begin with an uppercase letter.
published results from a handcoded implementation.
Narada periodically gossips with neighbors to refresh mem-
bership information. We start with a rule that causes a node
2.3
OverLog by Example: A Narada Mesh
to initiate a refresh:
Narada is a popular overlay multicast system, which im-
plements the multicast functionality using two layers: the
R1 refreshEvent(X) :- periodic(X, E, 3).
first layer constructs and maintains a mesh connecting all
In Datalog, this rule with identifier
members in the group, while the second layer constructs de-
R1 would be read as
“table
livery trees on top of the mesh using a DVMRP-like multi-
refreshEvent has a row with value (X), for any X,
if table
cast algorithm [11]. We focus on constructing a Narada-like
periodic has a row with value (X, E, 3), for some
mesh here as an example of the use of OverLog.
E.” Because of the use of streams and continuous queries in
P2, the OverLog interpretation is slightly different.
Briefly, the mesh maintenance algorithm works as follows.
First,
Each node maintains a set of neighbors, and the set of all
periodic is a built-in term; it is not a stored ta-
ble but a stream that periodically produces a tuple with a
members in the group. Every member epidemically propa-
unique identifier
gates keep-alives for itself, associated with a monotonically
E at node X – in this example the period
is 3 seconds. Since
increasing sequence number. At the same time, neighbors
refreshEvent and periodic are data
streams rather than stored tables, it is more appropriate
exchange information about membership liveness and se-
to read this rule as “generate a
quence numbers, ensuring that every member will eventually
refreshEvent tuple with a
value
learn of all the other group members’ liveness. If a member
(X) whenever you see a periodic tuple of value (X,
fails to hear from a direct neighbor for a period, it declares
E, 3).”
Before a Narada node can refresh its neighbors, it must
its neighbor dead, updating its own membership state and
update its own sequence number, stored in the singleton
propagating this information to the rest of the population.
table
In addition, each node A periodically probes a random
sequence.
group member B measuring their round-trip latency. If the
R2 refreshSeq(X, NewSeq) :- refreshEvent(X),
probed node (B) improves the routing utility of node A by
sequence(X, Seq), NewSeq := Seq + 1.
a certain threshold, node A adds node B to its neighbor
R3 sequence(X, NewS) :- refreshSeq(X, NewS).
set. Similarly, if node A concludes that the cost of a link to
neighbor B exceeds some predefined threshold, it removes
Every time the refreshEvent is issued for a node X, rule
B from its neighbor set.
R2 creates a new refresh sequence number NewSeq for X by
In the rest of this section, we show how the mesh main-
incrementing the currently stored sequence number Seq in
tenance portion of Narada can be expressed in OverLog. In
the sequence table. Rule R3 updates the stored sequence
the interest of brevity, we omit node removal and recovery
number. Because sequence is a materialized table instead of
3
a data stream, whenever a new sequence tuple is produced,
To join the mesh, a new node need only know one member
as is done with rule R3, it is implicitly inserted into the
of the mesh, placing that member into its neighbor table.
associated table.
Though not explicitly specified in the
N1 neighbor@Y(Y, X) :- refreshSeq@X(X, S),
materialize direc-
neighbor@X(X, Y).
tives above, the neighbor table contains tuples of the form
This rule ensures that neighbor relationships are mutual.
neighbor(MyAddress, NeighborAddress)
Finally, rules L1-4 check neighbor liveness. Every second,
while the member table contains tuples of the form
rule L1 initiates a neighbor check by which rule L2 declares
dead a neighboring member that has failed to refresh for
member(MyAddress, MemberAddress, MemberSequence,
longer than 20 seconds. Dead neighbors are deleted from the
MemberInsertionTime, MemberLive)
neighbor table by rule L3 and rule L4 sets a dead neighbor’s
MemberLive is a boolean indicating whether the local node
member entry to be “dead” and further propagated to the
believes a member is live or has failed.
rest of the mesh during refreshes.
We now introduce location specifiers, which annotate the
components of a rule to specify the node at which the tuples
L1 neighborProbe@X(X) :- periodic@X(X, E, 1).
L2 deadNeighbor@X(X, Y) :- neighborProbe@X(X),
in question should exist. Consider the following:
neighbor@X(X, Y), member@X(X, Y, _, YT, _),
f_now() - YT > 20.
R4 member@Y(Y, A, ASeqX, TimeY, ALiveX) :-
L3 delete neighbor@X(X, Y) :- deadNeighbor@X(X, Y).
refreshSeq@X(X, S), member@X(X, A, ASeqX, _, AliveX),
L4 member@X(X, Neighbor, DeadSeq, T, false) :-
neighbor@X(X, Y), not member@Y(Y, A, _, _, _),
deadNeighbor@X(X, Neighbor),
TimeY := f_now@Y().
member@X(X, Neighbor, S, _, _),
DeadSeq := S + 1, T:= f_now().
This is read as follows: “if a refreshSeq tuple is seen at node
X with fields (X, S), and a (X, A, ASeqX,
, ALiveX) tu-
Note that rule L3 introduces an additional syntactic con-
ple is in X’s member table, and a (X, Y) tuple in X’s neighbor
struct (delete), used to delete tuples from a stored table.
table, and there is no member tuple in Y’s table for address A,
Narada continuously improves its mesh by measuring net-
then a member tuple (Y, A, ASeqX, TimeY, ALiveX) should
work latency to all members.
appear at node Y, where TimeY is the value of the built-in
P0 pingEvent@X(X, Y, E, max<R>) :-
function
1
f now() at Y .” f now returns a node’s wall-clock
periodic@X(X, E, 2), member@X(X, Y, _, _, _),
time, and as in other languages an underscore denotes a
R := f_rand().
“don’t care” variable. Because the location specifiers in this
rule belong to two different nodes, when this rule is executed
Every 2 seconds, rule P0 picks a member at random with
some data are shipped across the network.
which to measure round-trip latency. Specifically, it asso-
In terms of Narada, rule R4 specifies that whenever a re-
ciates a random number with each known member, and then
fresh happens at node X, for any of X’s members unknown to
chooses the member associated with the maximum random
Y, a copy of X’s member tuple for that member appears in Y’s
number. Note that function<fields> denotes an aggregation
table, with Y’s insertion time updated. Note here that this
function, max in this example.
logical rule makes no mention of where the complex body
P1 ping@Y(Y, X, E, T) :-
of the rule will be executed, or how many network messages
pingEvent@X(X, Y, E, _), T := f_now@X().
will be sent. Alternatively, a programmer could have spec-
P2 pong@X(X, Y, E, T) :- ping@Y(Y, X, E, T).
ified this functionality with explicit message transmissions
P3 latency@X(X, Y, T) :- pong@X(X, Y, E, T1),
(see Appendix A).
T := f_now@X() - T1.
Rule R5 below specifies how Y updates an existing member
When a tuple appears in data stream pingEvent, rule P1
entry when X performs a refresh.
pings the randomly chosen member stored in the event, rule
P2 echoes that ping, and rule P3 computes the round-trip
R5 member@Y(Y, A, ASeqX, TimeY, ALiveX) :-
latency of the exchange.
refreshSeq@X(X, S), member@X(X, A, ASeqX, _, ALiveX),
Nodes use such latency measurements – along with the
neighbor@X(X, Y), member@Y(Y, A, ASeqY, _, _),
ASeqY < ASeqX, TimeY := f_now@Y().
paths computed by a routing protocol operating on top of
the mesh – to compute a utility function. A node may add
If both X and Y know member A, and if the sequence number
to its neighbors a member that is not currently a neighbor
that Y knows for A is older than that in X’s member entry,
if the utility gain of doing so lies over an addition threshold.
then Y updates its own entry for A with X’s sequence number
Similarly, if the cost of maintaining a current neighbor is
and the wall-clock time at Y.
greater than a removal threshold, the node may break its link
Finally, every time X refreshes Y, Y updates its member
with that neighbor. We show how neighbor addition would
entry for X itself, as per rule R6.
work in an OverLog implementation of Narada below. We
assume that each node maintains a routing table over the
R6 member@Y(Y, X, S, TimeY, true) :-
mesh which contains for each member the next hop to that
refreshSeq@X(X, S), neighbor@X(X, Y),
member and the cost of the resulting path; e.g., route@X(X,
TimeY := f_now@Y().
Y, N, C) indicates that node X must route via next-hop
1We are purposely vague about the type of location spec-
node N to get to destination Y with a path latency of C.
ifiers. For simplicity, one can simply think of them as IP
addresses. However, we make no assumption about the un-
U1 ugain@X(X, Z, sum<UGain>) :- latency@X(X, Z, T),
derlying addressing scheme in the network, and there may
not neighbor@X(X, Z), route@Z(Z, Y, _, C),
be a case for using P2 in contexts where addressing is dif-
UNew := T + C, route@X(X, Y, _, UCurr), UNew < UCurr,
ferent.
UGain := (UCurr - UNew) / UCurr.
4
Rule U1 measures the utility gain that could be obtained if
traditional protocol specifications and implementations, and
node Z were to become X’s immediate neighbor, as per the
in the MACEDON overlay toolkit. Relative to automata,
Narada definition [8]. For an individual destination Y, this is
logical specifications and dataflow graphs have a number of
computed by taking the latency of Z’s path to Y and adding
software engineering advantages:
the latency between X and Z to it. If this new path latency
(assuming Z becomes the next hop from X) is lower than the
• Scoping: In principle, automata must handle any pos-
current latency of X’s route to Y, then the relative decrease
sible event (message) in each state. While automata
in latency contributes to the utility gain by adding neighbor
can in principle be nested or encapsulated as a matter
Z. If this utility gain is above a threshold addThresh, then
of design discipline, the potentially arbitrary interplay
rule U2 adds this new neighbor
between states and events leads to relatively few design
constraints, making automata difficult both to specify
U2 neighbor@X(X, Z) :- ugain@X(X, Z, UGain),
correctly, and to understand once specified. By con-
UGain > addThresh.
trast, in a dataflow diagram compiled from an OverLog
Appendix A collects the mesh formation portion of our
program, the inputs to an element are by definition
Narada specification in OverLog. This specification consists
coming from other specific elements whose behavior is
of just 16 rules and 3 table declarations. The OverLog spec-
well specified. This constrains what needs to be han-
ification is perhaps not as easy to read as pseudocode, but
dled in each element implementation, aiding in both
remarkably it can be directly compiled and executed by a
specification and comprehension.
set of P2 nodes to maintain a Narada-style mesh.
2.4
Dataflow
• Typing: Similarly, the events or messages handled in
automata are of any type possible in the system. In
Given that OverLog is a declarative query-like language
dataflow diagrams, all tuples that pass along an edge
over distributed nodes, it is natural to consider compiling it
share the same schema, hence a dataflow element im-
into an executable representation akin to a database “query
plementation need only worry about a stream of simi-
plan.” Parallel and distributed database query systems like
lar, well-formed tuples.
Gamma [12], Volcano [14] and PIER [16] use dataflow graphs
as their basic query executables: these graphs connect vari-
• Encapsulation and Reuse: Because automata in-
ous database “operators” with dataflow edges that represent
terrelate possible events and states, they are difficult
the passing of tuples among operators, possibly across a net-
to reuse in other contexts that may involve different
work. A query engine runtime can execute these dataflow
sets of events, or additional states with different in-
programs over stored and streaming data sources.
teractions. By contrast, subsets of rules in OverLog
Traditionally, network implementation models are built
programs can be easily extracted and reused in other
on automaton abstractions, which do not appear at first
programs. Moreover, even the compiled dataflow di-
sight to have much in common with database query plans.
agrams often have discrete subgraphs that are clearly
However software router toolkits like Scout [25], Click [18]
reusable: a dataflow subgraph typically has a few well-
and XORP [15] in recent years have demonstrated that net-
specified inputs (incoming edges) and outputs (outgo-
work message handling and protocol implementation can be
ing edges), and in many cases has easily interpretable
neatly factored into dataflow diagrams, and that this model
behavior. This admits the possibility of allowing in-
provides clarity and extensibility beyond that offered by
coming programs to opportunistically “jump on” to
automata, without sacrificing performance. We adopt the
existing dataflows, in the spirit of adaptive stream
Click term element for a node in a P2 dataflow graph, but
query engines like TelegraphCQ [7].
as in database query plans, each edge in the graph carries a
stream of well structured tuples, rather than annotated IP
The modularity provided by a declarative language is also
packets. Note that while all tuples flowing on a single edge
useful for bootstrapping the system. We can compare P2 to
share a structure (schema), tuples on one edge may have
another distributed query processor built over a dataflow
very different structure than tuples on another – this is a
framework: PIER [16]. PIER uses a distributed hash table
significant distinction with the uniform IP packets of Click.
(Bamboo [29]) as a basic common substrate, which is then
P2 dataflows tend to mix together network packet process-
employed to instantiate query-specific overlay networks such
ing elements for tasks like queuing, (de)multiplexing, and
as aggregation trees. In contrast, P2 simply uses whatever
congestion control along with relational database operators
underlying network is present, and each node can be con-
like joins and aggregations. The use of joins is endemic to P2
figured with a relatively small set of “base facts” (such as
because of our choice of OverLog: the unification (match-
addresses of a few nearby neighbors).
Knowledge of the
ing) of variables in the body of a rule is implemented in a
rest of the network is then built up in the declarative do-
dataflow by an equality-based relational join or equijoin2, a
main. It is possible to construct a DHT over P2 – indeed,
point we return to in Section 2.5.
our main example in this paper is a version of Chord – but
P2 in no way requires a DHT to be present, nor relies on
2.5
Discussion
the assumptions a DHT typically exploits (such as full-mesh
The combination of a declarative language and a dataflow
connectivity between nodes, and lack of explicit control over
runtime forms a powerful and surprisingly natural environ-
node and data placement).
ment for overlay specification and runtime. The obvious al-
We close with a discussion on the (perhaps surprising)
ternative to our approach is the automaton approach used in
centrality of the database equijoin operator in our system
2To avoid confusion with the notion of a node “joining” an
for implementing overlay networks. First, we observe that
overlay, we will use the term equijoin to refer to relational
overlay network maintenance traffic is fundamentally a mat-
joins in the remainder of the paper.
ter of asynchronous data structure manipulation: matching
5
the query is parsed into an internal representation, the plan-
OverLog
Net
ner constructs a corresponding dataflow graph of elements,
and the graph is executed by the runtime until it is canceled.
Parser
In this section, we describe the system bottom-up, start-
ing with the runtime, then the dataflow framework, and
finally the parser and planner that translate from OverLog
parsed query
Dataflow
to dataflow element graphs and tables.
3.1
Runtime
graph
Planner
description
The P2 runtime consists of a library of useful classes and
Tables
functions upon which the rest of the system is built. We
limit our discussion here to those parts that are essential for
Runtime
understanding the rest of the system.
As basic data types, P2 uses Values, and Tuples. A Value
is a reference-counted object used to pass around any item
Figure 1: Block diagram of a P2 node
of information in the system; Value types include strings, in-
tegers, timestamps, and large unique identifiers. The Value
a stream of incoming structural change messages (node ar-
class, together with the rules for converting between the var-
rivals and departures, neighbor table updates, path changes,
ious value types, constitute the concrete type system of P2.
etc.) with existing state at a node to produce new state, new
A Tuple is a vector of Values, and is the basic unit of
messages, or both. This matching of asynchronous messages
data transfer in P2. Dataflow elements, described below,
is naturally representable as an equijoin of a stream and a
pass tuples between them, and table rows hold tuples.
table, whether or not it is highlighted as such in the execu-
Early on in the development of P2 we implemented PEL,
tion model. This issue is discussed further in the context of
a small but powerful expression language for manipulating
database queries in [28].
Values and Tuples. PEL is a stack-based RPN-like postfix
To illustrate the utility of equijoins, we consider the ex-
language, and provides all the natural operations on the P2
ample of rule
concrete data types. PEL is not intended for human use;
R6 from Section 2.3:
rather PEL programs are generated by the planner from
R6 member@Y(Y, X, S, TimeY, true) :-
OverLog. P2 includes a byte-code compiler for PEL and a
refreshSeq@X(X, S), neighbor@X(X, Y),
simple but fast virtual machine that executes the resulting
TimeY := f_now@Y().
code.
Building PEL early in the development process dramati-
Since neighbor has already been declared as a table, this
cally simplified the construction of the rest of P2. Several
rule specifies that the arrival of a refreshSeq event will
dataflow elements benefit greatly from being parameterized
catch any neighbor entries identified in the last 120 seconds.
by one or more PEL programs, as we describe below.
It translates into a simple equijoin of the refreshSeq stream
P2 is currently based on a single-threaded, event-driven
and the neighbor table.
loop using libasync from the SFS toolkit [24]. Each event
handler runs to completion before the next one is called.
3.
IMPLEMENTATION
Long-running computations must therefore be handed off to
Our P2 implementation runs over Linux and consists of
a separate thread if new events are to be handled in a timely
around 20,000 lines of C++, plus 3rd-party support libraries.
manner.
The design of P2 was inspired by prior work in both
databases and networking. Software dataflow architectures
3.2
Tables
like P2 occupy a constrained but surprisingly rich design
Our table implementation is relatively straightforward,
space that has been explored in a variety of contexts3. We
and we only discuss it briefly here. Tables are queues of
based our design in large part on our side-by-side compar-
tuples that implement the expiry time and size constraints
ison between the PIER peer-to-peer query engine [16] and
discussed in Section 2. Tables are named using unique IDs,
the Click router [18]. Like PIER, P2 can manage structured
and consequently can be shared between different queries
data tuples flowing through a broad range of query process-
and/or dataflow elements. Queries over tables can be spec-
ing operators, which may accumulate significant state and
ified by filters written in PEL, providing an expressivity
perform substantial asynchronous processing. Like Click,
roughly equivalent to a traditional database query over a
P2 stresses high-performance transfers of data units, as well
single table. In-memory indices (implemented using stan-
as dataflow elements with both “push” and “pull” modali-
dard balanced binary trees) can be attached to attributes of
ties. P2 differs at its core from both PIER and Click, but
tables to enable quick equality lookups. Note that the table
subsumes many of the architectural features of both.
implementation – including associated indices – is a node-
At a coarse grain, P2 consists of a Datalog parser, a plan-
local construct. The partitioning of tuples across nodes is
ner, a set of dataflow element implementations, and a run-
controlled by the @X location specifiers in the rules (Sec-
time plan executor (Figure 1). The life of a query is simple:
tion 2.3).
3
The current in-memory implementation serves our require-
There is also a rich hardware dataflow tradition in Com-
ments for implementing the overlays discussed in this pa-
puter Architecture (e.g., [27, 36]), with its own terminology
and points of reference. For brevity we will not consider
per, all of which tend to view their routing tables as “soft
those systems here, and when we refer to dataflow architec-
state.” The event-driven, run-to-completion model provided
tures, we limit our discussion to software dataflow.
by libasync obviates the need for locking or transaction
6
support in our application, and relatively simple indices suf-
over the network, queues rarely block when full (instead,
fice to meet our performance requirements. In the future,
they implement an explicit drop policy as in most other
there is clearly scope for table implementations that use sta-
routers), and consequently Click’s design can process pack-
ble storage for persistent data placement, or that wrap an
ets efficiently using only event-driven scheduling of dataflow,
existing relational database implementation.
together with “active elements,” invoked periodically by the
Click scheduler.
3.3
Dataflow framework
In contrast, not only do P2 dataflow graphs tend to branch
more, but tuples are frequently generated inside the dataflow
We now describe how we implement the P2 dataflow model
graph in response to the arrival of other tuples – most com-
described in Section 2. Since P2’s architecture is influenced
monly during equijoin operations, which are fundamental to
by the Click [18] modular router, we first give an overview of
OverLog’s rule constructs.
the Click model, and then describe how and why P2 departs
Furthermore, the consequences of dropping tuples due to
from the design of Click.
queue overflow in P2 are much more undesirable than the
As in Click, nodes in a P2 dataflow graph can be chosen
dropping of a packet in a router under high load. Many
from a set of C++ objects called elements. In database
queue elements in P2 dataflow graphs therefore “block” when
systems these are often called operators, since they derive
full or empty, and a low-latency mechanism is required for
from logical operators in the relational algebra. Although
restarting a particular dataflow when new tuples arrive or
they perform a similar function, P2 elements are typically
space becomes available.
smaller and more numerous than database operators. Unlike
P2 therefore implements a simple signaling facility to al-
textbook database query plans, P2 graphs need not be trees;
low elements to restart flows they have previously blocked.
indeed we make heavy use of cyclic dataflow for the kind of
An extra argument to each “push” or “pull” invocation be-
recursive queries that occur frequently when querying graph
tween elements specifies a callback (in effect, a continuation)
structures.
to be invoked at some later stage if and only if the dataflow
Elements have some number of input and output ports.
has been stalled as a result of the call.
An arc in the dataflow graph is represented by a binding
For a “pull” transition, if the pull call returns no tuple
between an output port on one element and an input port
then there is no data available. When a tuple does become
on another. Tuples arrive at the element on input ports,
available, the callback previously passed with the pull is in-
and elements emit tuples from their output ports. An input
voked. This call will typically happen as part of a push
port must be connected to an output port.
transition into the source element (e.g., in the case of equi-
Handoff of a tuple between two elements takes one of two
joins) or the passage of time (e.g., in a rate limiter), and the
forms, push or pull, determined when the elements are con-
recipient of the callback will generally schedule a deferred
figured into a graph. In a push handoff, the source element
procedure call to retry the pull as soon as possible.
invokes a virtual method on the destination, passing the tu-
“Push” transitions operate slightly differently, since the
ple, while in a pull handoff the destination calls the source
coupling of control flow and dataflow means that the desti-
requesting the tuple, which is returned as the result of the
nation of a push has to accept the tuple – if it did not, any
call. We return to the choice of connection types at the end
state operations that occurred previously in the dataflow
of this section.
chain would have to be undone. As a result, push calls are
The implementation of dataflow elements in P2 differs
always assumed to succeed, and return a boolean indicat-
from Click in significant ways, as a result of different re-
ing whether it is acceptable to call push again. If not, the
quirements.
callback will be invoked at some later stage as with pull.
First, the common case in a router is that a packet tra-
The use of callbacks in this way removes from the ele-
verses a single path through the dataflow graph. Conse-
ment implementation itself any scheduling decisions, while
quently Click implements copy-on-write for packets that must
imposing a minimum of policy. P2’s transitions are not as
be modified (for example, to implement multicast). This has
efficient as Click’s but are still very fast – most take about
the additional benefit of very lightweight hand-offs of pack-
50 machine instructions on an ia32 processor, or 75 if the
ets between elements – throughput is of primary concern in
callback is invoked.
Click, and inter-element handoff is simply pointer passing
through a virtual function call.
In contrast, the dataflow graphs that the P2 planner gen-
3.4
Dataflow elements
erates from OverLog specifications have many more branch-
This section gives a brief overview of the suite of dataflow
ing points and tuples traverse more than one path. For ex-
elements implemented in P2.
ample, a tuple might be stored in a table but also forwarded
To start with, P2 provides the relational operators found
to another element as an event notification. At the same
in most database systems, as well as query processors like
time, raw forwarding performance in P2 is less of a prior-
PIER [16]: selection, projection, streaming relational join
ity. This led to two related design decisions: first, tuples
operations, “group-by,” and various aggregation functions.
in P2 are completely immutable once they are created, and
Many of these elements are greatly simplified by parame-
second, tuples are reference-counted and passed between P2
terizing them with PEL programs; for example, a “project”
elements by reference. C++ inlining and templates mini-
element implements a superset of a purely logical database
mize this overhead at runtime.
projection operator by running a PEL program on each in-
Second, P2 passes tuples between elements rather than
coming tuple to generate an outgoing tuple.
annotated packets, and elements in P2 are frequently emu-
Since one of our motivations in designing P2 was to in-
lating database operators rather than packet routing func-
vestigate the applicability of the dataflow element model for
tions, which means flows frequently block and unblock. In
distributed computing, we have tried to push as much func-
Click, a flow event is typically initiated by a packet arriving
tionality of the system as possible into dataflow elements.
7
One example of this is in P2’s networking stack. Sys-
evaluates the expression over each tuple, dropping those for
tems like PIER [16] abstract details of transport protocols,
which the result is false. In some cases, we can optimize
message formats, marshaling, etc., away from the dataflow
the dataflow to push a selection upstream of an equijoin, to
framework, and operators only deal with fully unmarshaled
limit the state and work in the equijoin.
tuples. In contrast, P2 explicitly uses the dataflow model to
Aggregate operations like min or count are translated af-
chain together separate elements responsible for socket han-
ter equijoins and selections, since they usually operate on
dling, packet scheduling, congestion control, reliable trans-
one of the fields in the rule head. Aggregate elements gener-
mission, data serialization, and dispatch.
ally hold internal state, and when a new tuple arrives either
While P2’s networking subsystem exists entirely as a set of
compute the aggregate incrementally or run a PEL program
dataflow elements, at the OverLog level it is abstracted be-
to recalculate the new aggregate from scratch.
hind the @ syntax for location specifiers. A fuller exploration
The final part of translating each rule is the addition of a
of P2’s networking subsystem and its high-level specification
“projection” element that constructs a tuple matching the
is ongoing and is beyond the scope of this paper.
head of the rule, by using a PEL program to select and
A variety of elements form a bridge between the dataflow
reorder fields in the incoming tuple.
graph and persistent state in the form of stored tables. P2
In addition to creating the relational operations described
has elements that store incoming tuples in tables, lookup ele-
above, the planner also constructs the other areas of the
ments that can iteratively emit all tuples in a table matching
dataflow graph: the networking stack, including multiplex-
a search filter expressed in PEL, and aggregation elements
ing and demultiplexing tuples, marshaling, congestion con-
that maintain an up-to-date aggregate (such as max, min,
trol, etc. As with Click, it also inserts explicit queue el-
count, etc.) on a table and emit it whenever it changes. Ta-
ements where there is a push/pull mismatch between two
bles are frequently shared between elements, though some
elements that need to be connected.
elements hold private tables. For example, the element re-
Certain rule terms in OverLog, such as periodic and
sponsible for eliminating duplicate results in a dataflow uses
f now introduced in Section 2.3, refer to “built in” element
a table to keep track of what it has seen so far.
classes, which the planner also knows about and directly
Finally, like Click, P2 includes a collection of general-
maps to dataflow elements.
purpose “glue” elements, such as a queue, a multiplexer,
As an aside, we have found it useful to implement a log-
a round-robin scheduler (which, when pulled, pulls tuples
ging facility using the same dataflow framework. The plan-
from its inputs in order), etc.
ner can be configured (through additional OverLog direc-
It is quite simple to add new elements to the collection
tives) to connect a “logging” port on particular elements to
provided by P2, but at present the planner is not yet de-
a dataflow chain that sends such tuples over the network,
signed to be easily extensible. To use a new element class,
providing a flexible way to obtain logging and debugging
one must either “hand-wire” dataflow diagrams as in Click [18]
data at runtime.
and PIER [16], or modify the planner to translate OverLog
The planner has the potential to implement a number of
into dataflows that use the new element.
query optimizations in the database literature such as join
reordering [31], Magic Sets [2,21], and multi-query optimiza-
3.5
OverLog Translation
tion [32]; however currently these optimizations are achieved
by rewriting queries at the input by hand.
The OverLog parser in our implementation is fairly con-
ventional, and implemented using flex and bison. It con-
verts OverLog files into a canonical form and produces lists
4.
A BIGGER EXAMPLE: CHORD
of events, rules and table definitions. The heavy lifting of
We now present a more involved P2 example: a full im-
generating the dataflow graphs is performed by the planner,
plementation of the Chord distributed lookup service [34].
which generates a directed graph of dataflow elements from
We show that the OverLog specification of a complex over-
the query parse tree in a variety of phases. We describe the
lay such as Chord can be intuitive, concise, and reusable for
general process of translating an OverLog description here,
other overlays.
and later in Section 4 we explore a concrete example, that
Chord is essentially a mechanism for maintaining a ring.
of Chord.
Nodes join by contacting a landmark node within the ring
First, all the required tables and indices are created. We
that they already know. In OverLog, newcomer ni has a
create an index for every table’s primary key, and secondary
“materialized” fact in a table database pointing to its land-
indices for table keys participating in equijoins. Each rule
mark node li (or to the null node otherwise):
head is associated with either a table or a data stream con-
landmark@ni(ni,li).
sisting of tuples that pass through the dataflow engine with-
out being stored.
as well as a fact about its own address ni and identifier n:
Next, for each rule the planner identifies matching vari-
node@ni(ni,n).
ables across rule terms and creates a sequence of elements
implementing relational equijoins. As noted in Section 2,
To enter the ring, ni generates a join tuple at node ni,
our current version of OverLog only supports equijoins of a
whose arrival triggers the following rules:
stream and a table. Since tables are implemented as main-
C1 joinEvent@NI(NI,E) :- join@NI(NI,E).
memory data structures with local indices over them, tuples
C2 joinReq@LI(LI,N,NI,E) :- joinEvent@NI(NI,E),
from the stream are pushed into an equijoin element, and
node@NI(NI,N), landmark@NI(NI,LI), LI != "-".
all matches in the table are found via an index lookup.
C3 succ@NI(NI,N,NI) :- landmark@NI(NI,LI),
joinEvent@NI(NI,_), node@NI(NI,N), LI == "-".
After the translation of the equijoins in a rule, the plan-
C4 lookup@LI(LI,N,NI,E) :- joinReq@LI(LI,N,NI,E).
ner creates elements for any selection filters. Each filter is
C5 succ@NI(NI,S,SI) :- join@NI(NI,E),
compiled into a PEL expression, and a selection element
lookupResults@NI(NI,_,S,SI,E).
8
Ne
Join
Join
t
w
lookup.NI ==
TimedPullPush
lookup.NI ==
Select
Project
ork In
0
K in (N, S]
lookupRes
node.NI
bestSucc.NI
Join
Agg min<D>
TimedPullPush
bestLookupDist.NI
on finger
0
== node.NI
Mu
D:=K-B-1, B in (N,K)
x
p
Du
Join
Agg min<BI>
TimedPullPush
Ro
bestLookupDist.NI
on finger
Qu
0
u
== node.NI
n
p
D==K-B-1, B in (N,K)
dR
e
u
e
o
b
l
ooku ist
i
n
u
p
D
T
Insert
i
m
L
o
ok
T
i
e
m
dP
b
e
st
0
Insert
e
dP
ull
de
0 ull
P
us
no
Insert
P
us
h
h
s
t
S
u
c
c
na
Dem
(t
be
up
m
e
l
u
node
e
e
r
bestSucc
finger
)
x
fi
ng
Demux
remote
Queue
Network Out
(@local?)
local
Figure 2: Part of the dataflow that P2 generates from the Chord OverLog specification. Not shown are
the elements responsible for stabilization and maintenance of fingers. Following Click conventions, ports are
either triangular (for input) or rectangular (for output), black (for push) or white (for pull), and doubly lined
for agnostic. The “Network In” and “Network Out” elements represent a longer sequence of network-related
elements that we elide here for simplicity.
Rule C1 starts a join event upon arrival of the join tuple. In
tifier is found, F3 places that successor into all appropriate
rule C2, if the landmark node is known (i.e., non-null), then
finger entries: those whose target identifier n + 2I also lie
a joinReq (join request) tuple is sent to that landmark node;
between the local node’s identifier n and that successor on
otherwise C3 sets the node to point to itself as a successor,
the identifier ring.
forming an overlay by itself and awaiting others to join in.
For clarity of exposition, we have kept these finger-fixing
When the landmark receives a joinReq, rule C4 initiates a
rules rather na¨ıve and ignore possible optimizations that
lookup for the successor of the joining node’s identifier N.
speed up the process. In Appendix B we present the ac-
C5 defines the joining node’s successor to be the result of
tual optimized finger-fixing rules F1-9 that we use for our
the lookup. Successors and predecessors are materialized in
implementation of Chord.
tables, and pending join events are also stored ephemerally.
Lookups for key K seek the node whose identifier is the
A Chord node also holds a finger table, pointing at peers
immediate successor on the ring of K:
whose ID distances exponentially increase from itself:
L1 lookupResults@R(R,K,S,SI,E) :- node@NI(NI,N),
F1 fFix@NI(NI,E,I) :- periodic@NI(NI,E,tFix),
lookup@NI(NI,K,R,E), bestSucc@NI(NI,S,SI), K in
range(I,0,fNum-1), f_coinFlip(fFixProb).
(N,S].
F2 lookup@NI(NI,K,NI,E) :- fFix@NI(NI,E,I),
L2 bestLookupDist@NI(NI,K,R,E,min<D>) :-
node(NI,N), K:=N+1<<I.
node@NI(NI,N), lookup@NI(NI,K,R,E),
F3 finger@NI(NI,I,B,BI) :- fFix@NI(NI,E,I),
finger@NI(NI,_,B,_), D:=K - B - 1, B in (N,K).
lookupResults@NI(NI,K,B,BI,E), K in (N+1<<I,N),
L3 lookup@BI(min<BI>,K,R,E) :- node@NI(NI,N),
node@NI(NI,N).
bestLookupDist@NI(NI,K,R,E,D),
finger@NI(NI,_,B,BI), D == K - B - 1, B in (N,K).
At time intervals of tFix seconds, a “fix-finger” event is
triggered for the I-th finger, in rule F1, where I is in the
Rule L1 returns a successful lookup result if the received
range of valid finger entry indices. F1 chooses whether to
lookup seeks a key K found between the receiving node’s
fix each finger entry according to a probability distribution;
identifier and that of its best successor (we come back to
it could, instead, be made to pick indices in order (by re-
the best successor below). In parallel, rule L2 finds the min-
placing f coinFlip with a counter equated to I) or can be
imum distance from the local node’s fingers to K, for every
made to pick all indices, by removing f coinFlip altogether.
finger node BI whose identifier B lies between the local node’s
F2 issues a Chord lookup for the identifier that is 2I away
identifier N and K. The first address of a finger with that
from the local node’s identifier. If a successor of that iden-
minimum distance to the key K is chosen in L3 to receive the
9
forwarded lookup. The condition B in (N,K) ensures that
4.1
Discussion: OverLog
either L1 or L3 produces a result, not both.
The above 15 rules are sufficient to specify a Chord overlay
In a Chord configuration such as MACEDON’s where only
that is reasonably faithful to the original Chord proposal.
a single successor is maintained, a “best successor” can be
Having now presented versions of both Narada meshes and
defined simply as:
Chord, it is reasonable to ask how good a fit the OverLog
language is for overlay specification.
N1 bestSucc@NI(NI,S,SI) :- succ@NI(NI,S,SI).
OverLog succeeds in its goal of conciseness. It can cap-
However, replacing this rule with the following two allows
ture a complex overlay like Chord remarkably succinctly, in
our Chord specification the maintenance of more successors
a form that can be automatically translated to a dataflow
for resilience of the overlay (a typical value is log n, where
framework. Moreover, this process is amenable to a num-
n is the estimated size of the network):
ber of optimization techniques and presents several oppor-
tunities for sharing state, communication, and computation
N1 bestSuccDist@NI(NI,min<D>) :- node@NI(NI,N),
across multiple overlay specifications.
succ@NI(NI,S,_), D := S - N - 1.
On the other hand, while succinct, OverLog has a non-
N2 bestSucc@NI(NI,S,SI) :- succ@NI(NI,S,SI),
trivial learning curve for programmers used to imperative
bestSuccDist@NI(NI,D), node@NI(NI,N),
languages like Java and C. One example issue is the trick-
D == S - N - 1.
iness of coding up the equivalent of if-then-else logic. For
N1 and N2 define as “best” the successor among those stored
example, consider the lookup rules above that treat either
in the succ stored table whose identifier distance from the
a locally-satisfied lookup (L1) or forwarding (L2 and L3).
current node’s identifier is the lowest. Candidate successors
The order of evaluating the three rules is unspecified; the
(and predecessor) are supplied during the stabilization phase
two cases are essentially tested “in parallel.” Hence it is
of the Chord overlay maintenance. In OverLog, one of the
important in specifying the bodies of these rules that only
several Chord stabilization activities looks as follows:
one or the other case of lookup logic can be satisfied under
any circumstances. This has minimal performance impact
SB5 sendSuccessors@SI(SI,NI) :- stabilize@NI(NI,_),
since the common rule components can easily be compiled
succ@NI(NI,_,SI).
to share a dataflow subgraph – indeed, it may have signifi-
SB6 succ@PI(PI,S,SI) :- sendSuccessors@NI(NI,PI),
cant performance benefits in a multiprocessor environment,
succ@NI(NI,S,SI).
an issue we plan to investigate in the future. That said, the
In SB5, a node asks all of its successors to send it their own
programming style is unusual. Some syntactic sugar would
successors, whenever the stabilize event is issued. SB6 in-
help with this issue, but it highlights the kinds of unconven-
stalls at the original node the returned successor. Successor
tional thinking that arise for the declarative programmer.
selection, not shown here, only keeps those successors clos-
Similarly, we are sensitive to the fact that the syntax of
est to a node in the table, evicting the remaining nodes.
Datalog – and hence OverLog – is an acquired taste. Our
Due to space constraints, we limit ourselves here to this
Chord specification is about as long as the pseudocode in the
15-rule OverLog specification of a simplistic but functional
original paper, but harder for most humans to read. Obvi-
version of the Chord overlay. However, a fuller specification
ously the ability of P2 to execute our specification gives it
including all Chord stabilization activities, explicit succes-
a qualitative advantage over the Chord pseudocode, but a
sor evictions, and connectivity monitoring for fault tolerance
different syntax might make it more readable to systems
requires no more than 47 OverLog rules, which can be in-
programmers. The success of SQL is suggestive here – it
put to the P2 parser directly, producing an automatically
is easier for most people to read simple SQL queries than
generated running Chord implementation. This version is
their Datalog equivalents. However, SQL’s support for re-
available in Appendix B.
cursive rules is awkward, and recursion is key to networking
Figure 2 illustrates the generated dataflow that captures
protocols because of their reliance on graph transitivity –
the lookup rules L1, L2, and L3, following the notational
e.g., constructing multi-hop paths as with Chord lookups
conventions of Click.
The dataflow is made up of three
out of single-hop links. Some middle-ground syntax may be
parts. First, the three rules are translated into the shaded
appropriate for our purposes.
dataflows in the middle of the diagram. Below them lie the
More fundamentally, it is less clear that OverLog achieves
simple dataflows that store into tables any received store-
the goals set out in Section 2 for logical structure of the
able tuples node, bestSucc and finger.
specification. Our Chord rules have a somewhat operational
Input to these flows comes from a large demultiplexer on
flavor: rather than declaratively stating the invariants of
the bottom left of the figure, which classifies its push input
the Chord data structure – a ring with fingers at increasing
according to its tuple name, and forwards it to the appro-
powers of two – our specification has rules for the activities
priate rule. Note that lookup tuples are duplicated by the
of “joining” the overlay and “fixing fingers.” In this respect,
“Dup” element, since they appear on the right hand sides
of course, it closely follows the Chord pseudocode, and the
of both L1 and L2. On the other side of the graph, output
pseudocode itself is operational. It could be argued that our
tuples are merged by the round-robin pull scheduler into
Chord description is still reasonably declarative in nature,
a single stream, which is then demultiplexed according to
but the key observation is that OverLog neither encourages
tuples’ network destination (the “@” notation of OverLog).
nor requires structural specification of the overlay, and this
Remote tuples are sent via an output queue to the network
is a consequence of the language’s generality.
stack to be packetized, marshaled, and buffered by our UDP
Based on our experience so far, we are designing a re-
transport, while tuples destined for local consumption are
placement for OverLog that has a more readable syntax,
wrapped around to the left of the figure and queued along
and frames the overlay design process more concretely as
with other input tuples arriving over the network.
two separate specifications: structural invariants, and the
10
(i)
(ii)
(iii)
0.18
350
1
0.16
100
300
300
0.14
500
0.8
250
0.12
0.1
200
0.6
0.08
150
CDF
0.4
Frequency 0.06
100
100
0.04
0.2
300
0.02
50
500
0
Maintenance BW (Bytes/s)
0
0
0
2
4
6
8 10 12 14 16
100
200
300
400
500
0
2
4
6
8
10
Hop Count
Population Size
Latency (s)
Figure 3: Performance of static Chord networks of different sizes.
(i) shows hop-count distribution for
lookups, (ii) shows maintenance traffic in the absence of churn for different population sizes; and (iii) shows
measured cumulative distribution of the latency over all lookups.
dynamics of maintenance rules.
implementation. While the latency increases with network
size (as might be expected), on a 500-node static network
5.
EVALUATION
96% of all lookups complete in 6 seconds or less. Our la-
tency numbers are within the same order of magnitude as
In this section we present performance results from P2.
the published numbers [34] of the MIT Chord deployment.
We have two goals in this evaluation.
First, we wish to
validate that our simple declarative specifications of complex
5.2
Handling Churn
overlays result in the expected network properties, including
In our second round of experiments, we focus on the per-
topological properties and messaging overheads. Second, we
formance of our Chord implementation under varying de-
examine our raw performance with an eye toward feasibility:
grees of membership churn. Again, our goal is to validate
we do not expect our per-node performance to be as good
that our compact specification of Chord faithfully captures
as a highly-tuned hand-coded implementation, but we would
its salient properties. We bring up a 400 node Chord net-
like it to be acceptable.
work, and then churn the network for 20 minutes, following
In our experiments, we focus on the full Chord DHT spec-
the methodology in [29].
ification in Appendix B. We chose to present Chord results
Figure 4 shows our results. The first graph (i) measures
largely because it is a good stress test of our architecture,
the “maintenance traffic” for the network, that is, all traffic
being relatively complex compared to other examples like
not associated with lookups and responses. This includes
gossip and end-system multicast. Chord also has the advan-
traffic for checking the status of nodes in the network, and
tage of being well-studied.
recovery traffic as the nodes come and go. As in the static
We deployed P2 on the Emulab testbed [38], with in-
case, our maintenance traffic is fairly respectable.
stances of P2 spread over a network of 100 machines. Ex-
Figure 4(ii) examines consistency of lookups (following
cept where noted, the network is a GT-ITM transit-stub
the experimental setup of Bamboo [29]), and Figure 4(iii)
topology with 10 transit domains and 100 stubs (one per
considers raw latency of external (i.e., non-maintenance-
physical node) emulated using DummyNet on 10 additional
related) lookups under churn. P2 Chord does respectably
machines. The RTT between two transit nodes is 50ms,
under low churn (session times of 64 minutes and above),
while the RTT between two nodes in the same domain is
generating at least 97% consistent lookups, most of which
2ms. The link capacity is set to 100Mbps and 10Mbps for
complete within 4 seconds. On the other hand, under high
domain and stub nodes respectively.
churn (session times of 16 minutes and less), P2 Chord does
5.1
Feasibility Experiments
not perform well, producing only 42% and 84% consistent
lookups with high latencies.
We begin with the high-level characteristics of the Chord
Ultimately, an evaluation of a system like P2 rests on an
overlay, to validate that the P2 implementation exhibits the
assessment of the ideal tradeoff between code size and per-
expected properties. We generate a uniform workload of
formance. Our current Chord overlay written in OverLog
DHT “lookup” requests to a static membership of nodes
performs acceptably, but clearly does not attain the pub-
in the overlay, with no nodes joining or leaving. This is
lished figures for the MIT implementation (at least 99.9%
somewhat unrealistic but it allows us to observe the static
consistency for a session time of 47 minutes, and mean lookup
properties of Chord.
latency of less than 5 seconds under high churn).
Figure 3(i) shows the hop-count distribution for our work-
We conjecture that a carefully crafted and complete MACE-
load. As expected, the hop-count averages log N/2, where
N
DON implementation of Chord might outperform our 47
is the size of the network. Figure 3(ii) shows the main-
OverLog rules, but it would be more than an order of mag-
tenance traffic for the network; this is the bandwidth used
nitude more complex. We note that the 320-line version
by the nodes while idling. As can be seen, our overlay is
supplied with the distribution4 is not sufficiently complete
configured to use relatively low bandwidth – networks aim-
to evaluate under churn or compare meaningfully to MIT
ing at high consistency and low latency typically use about
1 KByte/s per node [29].
4See chord.mac in MACEDON release 1.2.1-20050531, from
Figure 3(iii) shows the raw latency performance of our
http://macedon.ucsd.edu/release/.
11
(i)
(ii)
(iii)
500
450
1
8 min
1
400
16 min
0.8
0.8
350
32 min
300
64 min
0.6
128 min
0.6
250
CDF
CDF
8 min
200
0.4
0.4
16 min
150
32 min
100
0.2
0.2
64 min
50
Consistency threshold
128 min
Maintenance BW (Bytes/s)
0
0
0
8
16
32
64
128
0
0.2
0.4
0.6
0.8
1
0
2
4
6
8 10 12 14 16 18 20
Session Time (min)
Consistent fraction
Latency (s)
Figure 4: Performance of a 400-node Chord overlay under varying degrees of churn (8, 16, 32, 64, and 128
minute session times). (i) shows maintenance traffic over time during our experiment; (ii) shows consistency
of lookups over time, and (iii) shows lookup latency under churn. Each is a cumulative distribution over all
lookups.
Chord or our implementation over P2; for example, MACE-
ments to C++. Because the output of the MACEDON com-
DON’s Chord implements 32-bit node identifiers instead of
piler closely mirrors the structure of the code that a skilled
the traditional 160 bits, and provides only a single successor
programmer would produce for an overlay, we believe that
for each node, making it highly likely that the ring becomes
the performance of a well-tuned MACEDON network could
partitioned.
approach a custom implementation with less code than C++
would require, though the current MACEDON overlay suite
6.
RELATED WORK
does not approach this.
An interesting alternative to state machines is RTAG [3],
The design of P2 can be viewed as a synthesis of ideas from
where the protocol is expressed as a grammar. Incoming
database systems, particularly recent work in distributed
messages and events are modeled as tokens causing reduc-
and continuous query processing, with logic languages, and
tions in grammar rules, and the state of a connection is held
the result applied to overlay networks. Throughout this pa-
on the parser stack rather than encapsulated in an FSM.
per we have highlighted the database systems techniques we
The i3 [33] infrastructure offers a rendezvous-based ab-
have employed in P2; in this section we situate OverLog and
straction that provides significant flexibility in specifying
P2’s approach in the context of existing work on generating
communication structures and patterns. i3 is similar in fun-
overlays from protocol descriptions.
damental ways to the relational abstraction used in P2– the
There is a long tradition of automating the generation of
decoupling of senders and receivers via keys in i3 is similar to
protocol implementations from specifications. Much of the
the keys and foreign keys of relational models, and the same
early work focuses on expressing OSI protocols in a finite
flexible indirections are possible in both. However, i3 is tar-
state machine language (Esterel, Estelle, LOTOS, SDL [5,
geted as a fundamental communication abstraction, whereas
35], etc.), and compiling them into CPU-efficient implemen-
P2 is a more functional but arguably more special-purpose
tations (e.g., [9, 37]). The focus of this body of work is on
system.
supporting formal protocol specification (e.g., for verifica-
Although Click’s configuration language unambiguously
tion), ideally without sacrificing CPU performance.
specifies the dataflow elements and graph to be generated,
More recently, there has been increased emphasis on read-
the idea of using a high-level logic language for describ-
ability and code reuse for network protocol specification lan-
ing network protocols seems relatively new. Loo et al. [21]
guages. This includes domain-specific object-oriented lan-
recently proposed performing IP routing using declarative
guages like Morpheus [1] and Prolac [17], and functional
queries, also written in a variant of Datalog. Our imple-
approaches like the Fox project’s TCP implementation [6].
mentation of P2 is focused on overlay construction rather
It is typical to think of implementing network protocols
than IP routing, but as we discussed in Section 3.5, some
in terms of state machines, whose transitions are triggered
of the optimization techniques suggested in [21] from the
by timer events or the arrival of messages (either over the
deductive database literature are applicable in P2.
network or from a local client application). This is the ap-
As in [21], by representing the desired network properties
proach taken in essentially all of the work cited above. Net-
in OverLog at a higher level of abstraction than a dataflow
work protocol stacks are typically implemented in such a
graph, protocol grammar, or FSM description, P2 achieves
manner even when hand-coded, and this approach lends it-
very concise descriptions that can nevertheless generate ex-
self to object-oriented programming languages, where finite
ecutable dataflow graphs to maintain the overlay.
state machines (FSMs) are encapsulated in software objects.
In addition to conciseness, as section 2.2 discusses, a top-
Overlays built using an FSM approach are generally event-
down approach like OverLog offers more opportunities for
driven from a Unix “select” loop or equivalent OS function-
compile- and run-time optimization of overlay descriptions,
ality, and can be highly efficient in terms of resource usage
and OverLog’s decomposition of state into tables and flows
(CPU, etc.) on a node. A recent example of this approach
provides more natural opportunities for code reuse and run-
is MACEDON [30], which adopts the FSM approach by en-
time sharing.
capsulating the event loop, timers, finite state machines, and
message formats, and compiling the resulting syntactic ele-
12
7.
CONCLUSION AND FUTURE WORK
addressed by a monolithic transport solution. P2’s dataflow
In this paper, we have investigated the feasibility of ex-
architecture makes such choices as simple as rearranging the
pressing overlay networks in a declarative language, and
bindings between a few common elements.
then directly executing the resulting specification to con-
Similarly, making the transport “layer” a graph of dataflow
struct and maintain the overlay network.
elements means that different bits of functionality can be
The approach looks promising: overlays can be specified
broken up and spread out throughout the application logic.
extremely concisely, yet with enough detail to be executed
For instance, one can push transmission retries to happen
with performance and robustness acceptable for many ap-
upstream of route selection, to allow nodes route flexibil-
plications.
Furthermore, OverLog makes it easy to alter
ity when the next hop is not the ultimate destination for a
overlay routing algorithms without reimplementing complex
transmission. Or, an application can push a transmit buffer
state machines.
upstream, near where tuples are first produced, reducing
Our current system successfully compiles and runs speci-
the amount of queuing that tuples encounter outside of P2,
fications for all overlays discussed in this paper. Our plan-
which minimizes the time during which they are unavailable
ner does not currently handle directly some of the more in-
for computation. Instead, tuples are available for compu-
volved constructs of the OverLog language, such as multi-
tation as soon as possible after they arrive on a node, get-
node rule bodies and the logic of negation; slightly wordier
ting pulled out of a table, routed, marshaled, and sent only
rule rewrites allow us to circumvent these limitations, as the
when the corresponding socket can send data, the conges-
next design iteration of the planner takes shape. The ap-
tion window has room, and the outgoing packet scheduler
pendices contain overlay specifications that are executable
has selected the tuple’s dataflow.
by our system as of July 2005.
On-line distributed debugging: A unique opportunity
We continue our efforts in several directions.
offered by our system is its multi-resolution programming
Breadth: In the short term, we are working on coding
paradigm, allowing the programmer to specify a system as a
a variety of other overlay networks in OverLog: epidemic-
set of logical rules, which are translated to a dataflow graph
based networks, link-state- and path-vector-based overlays,
and then executed at runtime. By exposing the execution
and further DHT schemes. This has the benefit of exercising
history of the system – in terms of rules that fired at a first
the P2 planner and ensuring that OverLog is sufficiently
level, or in terms of actions taken by each dataflow element
expressive to cover the design space, but will also enable us
at a second level – P2 is astonishingly capable to provide
to start to identify common constructs that can be factored
introspection support for on-line overlay debugging. In fact,
out of particular overlay specifications and shared.
since execution history at any resolution can be exported
Sharing: Sharing is intriguing not only in terms of code
as a set of relational tables, much as everything else with
reuse, but also for the possibility that multiple overlays can
P2, debugging itself becomes an exercise in writing OverLog
execute simultaneously, sharing state, communication, and
rules for high-level invariants or for execution history min-
computation by sharing dataflow subgraphs. Sharing be-
ing. We are actively building up the infrastructure within
tween multiple overlays can allow a single application to
P2 to evaluate the exciting debugging possibilities that such
achieve different performance goals for different tasks, by ef-
support would enable.
ficiently deploying multiple overlay variants simultaneously.
Security: It is a natural, short step (though perhaps a
For example, a peer-to-peer content distribution network
precarious one) to move from system introspection to secu-
that combines search with parallel download might choose
rity assessment. P2’s runtime composition of small, simple
to construct two different overlays for these very different
dataflow elements along with flat state tables implies that
tasks. These overlays might fruitfully share rules for liveness
we might be able to express security invariants for each el-
checking, latency and bandwidth estimation, etc. Runtime
ement in isolation, which would ideally compose into global
sharing across overlays can also allow separately deployed
security properties for the whole system. We are exploring
systems to co-exist efficiently within a shared network infras-
the application of our earlier work on preserving the his-
tructure; this might become important if overlays emerge as
toric integrity of a distributed system [22] to the problem of
a prevalent usage model for the Internet.
making the execution of a complex overlay tamper-evident.
The na¨ıve approach for sharing is to do so explicitly at the
Language: The version of OverLog that we describe in
OverLog level, by sharing rule specifications. However, we
this work is a strawman vehicle, expressive enough to aid
hope to apply multiquery optimization techniques from the
us in our initial exploration of the design space. We have
database literature to identify further sharing opportunities
produced a preliminary formal semantics for OverLog, to
automatically, within the P2 planner. This enhancement to
enable reasoning about program properties (safety, termina-
the planner will be in tandem with applying some of the
tion, etc.). We are also exploring the extensions in the lan-
single query optimization techniques we mentioned in sec-
guage that would permit a high-level specification of sharing,
tion 3.5.
transport, debugging, and security logic, as described above.
Transport Protocols: While bringing different overlays
We expect to revisit the choice of Datalog as a basis for
to P2, we have rediscovered the multifaceted nature of over-
OverLog. As we have discussed, Datalog’s generality makes
lay requirements on network and transport facilities. Differ-
it an ideal choice as a “first cut” declarative language for
ent applications require different combinations of reliable,
overlays. How OverLog can be improved by tailoring its
in-order, congestion-, and flow-controlled transports, and
syntax and semantics more specifically towards overlay de-
different levels of control and inspection for each. For exam-
scription is an interesting research direction.
ple, going from applying TCP-friendly congestion control on
a per-peer basis (as in Bamboo [29]) to having a single con-
8.
ACKNOWLEDGMENTS
gestion window for the whole node (as in MIT Chord [20]) is
We would like to thank our shepherd Dahlia Malkhi, Brent
a matter of design requirements that may not be sufficiently
Chun for initial encouragement and help using his distributed
13
automated testing tool for our evaluation, David Gay for
Comput. Syst., 18(3):263–297, 2000.
his significant contributions to OverLog’s operational se-
[19] F. T. Leighton. Introduction to Parallel Algorithms and
mantics, and Tristan Koo for testing code and performance
Architectures: Arrays, Trees, Hypercubes. Morgan
microbenchmarks. We are also indebted to Brent, David,
Kaufmann, San Mateo, CA, 1992.
Ryan Huebsch, Kevin Lai, Raghu Ramakrishnan, Sylvia
[20] J. Li, J. Stribling, T. Gil, R. Morris, and F. Kaashoek.
Comparing the performance of distributed hash tables
Ratnasamy, and Sean Rhea, for their thoughtful comments
under churn. In Proc. IPTPS, 2004.
on drafts of this paper. Finally, our paper has benefitted sig-
[21] B. T. Loo, J. M. Hellerstein, and I. Stoica. Customizable
nificantly from the detailed feedback offered by the anony-
routing with declarative queries. In Third Workshop on
mous reviewers.
Hot Topics in Networks (HotNets-III), Nov. 2004.
[22] P. Maniatis. Historic Integrity in Distributed Systems. PhD
9.
REFERENCES
thesis, Computer Science Department, Stanford University,
Stanford, CA, USA, Aug. 2003.
[1] M. B. Abbott and L. L. Peterson. A language-based
[23] G. Manku, M. Bawa, and P. Raghavan. Symphony:
approach to protocol implementation. IEEE/ACM
Distributed hashing in a small world. In Proc. USITS, 2003.
Transactions on Networking, 1(1), Feb. 1993.
[24] D. Mazi`
eres. A toolkit for user-level file systems. In Proc.
[2] S. Abiteboul, R. Hull, and V. Vianu. Foundations of
of the 2001 USENIX Technical Conference, June 2001.
Databases. Addison Wesley, 1995.
[25] D. Mosberger and L. L. Peterson. Making paths explicit in
[3] D. P. Anderson. Automated protocol implementation with
the Scout operating system. In Proc. OSDI, pages 153–167.
RTAG. IEEE Trans. Softw. Eng., 14(3):291–300, 1988.
ACM Press, 1996.
[4] H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel,
[26] R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu,
M. Cherniack, C. Convey, E. Galvez, J. Salz,
M. Datar, G. S. Manku, C. Olston, J. Rosenstein, and
M. Stonebraker, N. Tatbul, R. Tibbetts, and S. Zdonik.
R. Varma. Query processing, approximation, and resource
Retrospective on Aurora. VLDB Journal, 13(4), Dec. 2004.
management in a data stream management system. In
[5] G. Berry. The Foundations of Esterel, pages 425–454. MIT
Proc. CIDR, 2003.
Press, 1998.
[27] G. M. Papadopoulos and D. E. Culler. Monsoon: An
[6] E. Biagioni. A structured TCP in Standard ML. In Proc.
explicit token store architecture. In Proc. ISCA, May 1990.
SIGCOMM, 1994.
[28] V. Raman, A. Deshpande, and J. M. Hellerstein. Using
[7] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J.
state modules for adaptive query processing. In Proc.
Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy,
ICDE, 2003.
S. Madden, V. Raman, F. Reiss, and M. A. Shah.
[29] S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz.
TelegraphCQ: Continuous dataflow processing for an
Handling Churn in a DHT. In Proc. of the 2004 USENIX
uncertain world. In CIDR, 2003.
Technical Conference, Boston, MA, USA, June 2004.
[8] Y.-H. Chu, S. G. Rao, and H. Zhang. A case for end system
[30] A. Rodriguez, C. Killian, S. Bhat, D. Kostic, and
multicast. In Proc. of ACM SIGMETRICS, pages 1–12,
A. Vahdat. MACEDON: Methodology for Automatically
2000.
Creating, Evaluating, and Designing Overlay Networks”,.
[9] W. Dabbous, S. W. O’Malley, and C. Castelluccia.
In Proc. NSDI, March 2004.
Generating efficient protocol code from an abstract
[31] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A.
specification. In Proc. SIGCOMM, pages 60–72, 1996.
Lorie, and T. G. Price. Access path selection in a relational
[10] F. Dabek, J. Li, E. Sit, F. Kaashoek, R. Morris, and
database management system. In SIGMOD Conference,
C. Blake. Designing a DHT for low latency and high
pages 23–34, 1979.
throughput. In Proc. NSDI, Month 2004.
[32] T. Sellis. Multiple Query Optimization. ACM Transactions
[11] S. Deering and D. R. Cheriton. Multicast routing in
on Database Systems, 13(1):23–52, Mar. 1988.
datagram internetworks and extended LANs. ACM
[33] I. Stoica, D. Adkins, S. Zhaung, S. Shenker, and S. Surana.
Transactions on Computer Systems, 8(2):85–111, May
Internet indirection infrastructure. IEEE/ACM
1990.
Transactions on Networking, (2), Apr. 2004.
[12] D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens,
[34] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F.
K. B. Kumar, and M. Muralikrishna. Gamma - a high
Kaashoek, F. Dabek, and H. Balakrishnan. Chord: a
performance dataflow database machine. In VLDB, pages
scalable peer-to-peer lookup protocol for internet
228–237, 1986.
applications. IEEE/ACM Trans. Netw., 11(1):17–32, 2003.
[13] M. Fecko, M. Uyar, P. Amer, A. Sethi, T. Dzik, R. Menell,
[35] K. J. Turner, editor. Using formal description techniques –
and M. McMahon. A success story of formal description
An Introduction to Estelle, LOTOS and SDL. Wiley, 1993.
techniques: Estelle specification and test generation for
MIL-STD 188-220. Computer Communications (Special
[36] A. H. Veen. Dataflow machine architecture. ACM
Edition on FDTs in Practice), 23, 2000.
Computing Surveys, 18(4), Dec. 1986.
[14] G. Graefe. Encapsulation of parallelism in the Volcano
[37] S. T. Vuong, A. C. Lau, and R. I. Chan. Semiautomatic
query processing system. In Proc. of the 1990 ACM
implementation of protocols using an Estelle-C compiler.
SIGMOD International Conference on Management of
IEEE Transactions on Software Engineering, 14(3), Mar.
Data, Atlantic City, NJ, May 23-25, 1990, pages 102–111.
1988.
ACM Press, 1990.
[38] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad,
[15] M. Handley, A. Ghosh, P. Radoslavov, O. Hodson, and
M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An
E. Kohler. Designing extensible IP router software. In Proc.
integrated experimental environment for distributed
NSDI, May 2005.
systems and networks. In Proc. OSDI 2002, Boston, MA,
Dec. 2002.
[16] R. Huebsch, B. N. Chun, J. M. Hellerstein, B. T. Loo,
P. Maniatis, T. Roscoe, S. Shenker, I. Stoica, and A. R.
Yumerefendi. The architecture of PIER: an Internet-scale
query processor. In CIDR, pages 28–43, 2005.
[17] E. Kohler, M. F. Kaashoek, and D. R. Montgomery. A
readable TCP in the Prolac protocol language. In Proc.
SIGCOMM, 1999.
[18] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F.
Kaashoek. The Click modular router. ACM Trans.
14
APPENDIX
A.
NARADA IN OverLog
/** If I have none, just store what I got */
Here we provide an executable OverLog implementation
R6 member@X(X, Address, ASequence, T, ALive) :-
of Narada’s mesh maintenance algorithms. Current limita-
membersFound@X(X, Address, ASequence, ALive, C),
tions of the P2 parser and planner require slightly wordier
C == 0, T := f_now().
syntax for some of our constructs. Specifically, handling of
negation is still incomplete, requiring that we rewrite some
rules to eliminate negation. Furthermore, our planner cur-
/** If I have some, just update with the
rently handles rules with collocated terms only. The Over-
information I received if it has a higher
sequence number. */
Log specification below is directly parsed and executed by
our current codebase.
R7 member@X(X, Address, ASequence, T, ALive) :-
membersFound@X(X, Address, ASequence, ALive, C),
/** Base tables */
C > 0, T := f_now(), member@X(X, Address,
MySequence, MyT, MyLive), MySequence < ASequence.
materialize(member, infinity, infinity, keys(2)).
materialize(sequence, infinity, 1, keys(2)).
materialize(neighbor, infinity, infinity, keys(2)).
/** Update my neighbor’s member entry */
R8 member@X(X, Y, YSeq, T, YLive) :- refresh@X(X,
/* Environment table containing configuration
Y, YSeq, A, AS, AL), T := f_now(), YLive := 1.
values */
materialize(env, infinity, infinity, keys(2,3)).
/** Add anyone from whom I receive a refresh
message to my neighbors */
/* Setup of configuration values */
N1 neighbor@X(X, Y) :- refresh@X(X, Y,
YS, A, AS, L).
E0 neighbor@X(X,Y) :- periodic@X(X,E,0,1), env@X(X,
H, Y), H == "neighbor".
/** Probing of neighbor liveness */
/** Start with sequence number 0 */
L1 neighborProbe@X(X) :- periodic@X(X, E, 1).
L2 deadNeighbor@X(X, Y) :- neighborProbe@X(X), T :=
S0 sequence@X(X, Sequence) :- periodic@X(X, E, 0,
f_now(), neighbor@X(X, Y), member@X(X, Y, YS, YT,
1), Sequence := 0.
L), T - YT > 20.
L3 delete neighbor@X(X, Y) :- deadNeighbor@X(X, Y).
L4 member@X(X, Neighbor, DeadSequence, T, Live) :-
/** Periodically start a refresh */
deadNeighbor@X(X, Neighbor), member@X(X,
Neighbor, S, T1, L), Live := 0, DeadSequence := S
R1 refreshEvent@X(X) :- periodic@X(X, E, 3).
+ 1, T:= f_now().
/** Increment my own sequence number */
B.
CHORD IN OverLog
Here we provide the full OverLog specification for Chord.
R2 refreshSequence@X(X, NewSequence) :-
This specification deals with lookups, ring maintenance with
refreshEvent@X(X), sequence@X(X, Sequence),
a fixed number of successors, finger-table maintenance and
NewSequence := Sequence + 1.
opportunistic finger table population, joins, stabilization,
and node failure detection.
/** Save my incremented sequence */
/* The base tuples */
R3 sequence@X(X, NewSequence) :-
refreshSequence@X(X, NewSequence).
materialize(node, infinity, 1, keys(1)).
materialize(finger, 180, 160, keys(2)).
materialize(bestSucc, infinity, 1, keys(1)).
/** Send a refresh to all neighbors with my current
materialize(succDist, 10, 100, keys(2)).
membership */
materialize(succ, 10, 100, keys(2)).
materialize(pred, infinity, 100, keys(1)).
R4 refresh@Y(Y, X, NewSequence, Address, ASequence,
materialize(succCount, infinity, 1, keys(1)).
ALive) :- refreshSequence@X(X, NewSequence),
materialize(join, 10, 5, keys(1)).
member@X(X, Address, ASequence, Time, ALive),
materialize(landmark, infinity, 1, keys(1)).
neighbor@X(X, Y).
materialize(fFix, infinity, 160, keys(2)).
materialize(nextFingerFix, infinity, 1, keys(1)).
materialize(pingNode, 10, infinity, keys(2)).
/** How many member entries that match the member
materialize(pendingPing, 10, infinity, keys(2)).
in a refresh message (but not myself) do I have? */
R5 membersFound@X(X, Address, ASeq, ALive,
/** Lookups */
count<*>) :- refresh@X(X, Y, YSeq, Address, ASeq,
ALive), member@X(X, Address, MySeq, MyTime,
L1 lookupResults@R(R,K,S,SI,E) :- node@NI(NI,N),
MyLive), X != Address.
lookup@NI(NI,K,R,E), bestSucc@NI(NI,S,SI), K in
15
(N,S].
SB1 stabilize@NI(NI,E) :- periodic@NI(NI,E,15).
L2 bestLookupDist@NI(NI,K,R,E,min<D>) :-
SB2 stabilizeRequest@SI(SI,NI) :-
node@NI(NI,N), lookup@NI(NI,K,R,E),
stabilize@NI(NI,E), bestSucc@NI(NI,S,SI).
finger@NI(NI,I,B,BI), D:=K - B - 1, B in (N,K).
SB3 sendPredecessor@PI1(PI1,P,PI) :-
L3 lookup@BI(min<BI>,K,R,E) :- node@NI(NI,N),
stabilizeRequest@NI(NI,PI1), pred@NI(NI,P,PI), PI
bestLookupDist@NI(NI,K,R,E,D),
!= "-".
finger@NI(NI,I,B,BI), D == K - B - 1, B in (N,K).
SB4 succ@NI(NI,P,PI) :- node@NI(NI,N),
sendPredecessor@NI(NI,P,PI),
bestSucc@NI(NI,S,SI), P in (N,S).
/** Neighbor Selection */
SB5 sendSuccessors@SI(SI,NI) :- stabilize@NI(NI,E),
succ@NI(NI,S,SI).
N1 succEvent@NI(NI,S,SI) :- succ@NI(NI,S,SI).
SB6 returnSuccessor@PI(PI,S,SI) :-
N2 succDist@NI(NI,S,D) :- node@NI(NI,N),
sendSuccessors@NI(NI,PI), succ@NI(NI,S,SI).
succEvent@NI(NI,S,SI), D:=S - N - 1.
SB7 succ@NI(NI,S,SI) :-
N3 bestSuccDist@NI(NI,min<D>) :-
returnSuccessor@NI(NI,S,SI).
succDist@NI(NI,S,D).
SB7 notifyPredecessor@SI(SI,N,NI) :-
N4 bestSucc@NI(NI,S,SI) :- succ@NI(NI,S,SI),
stabilize@NI(NI,E), node@NI(NI,N),
bestSuccDist@NI(NI,D), node@NI(NI,N), D == S - N
succ@NI(NI,S,SI).
- 1.
SB8 pred@NI(NI,P,PI) :- node@NI(NI,N),
N5 finger@NI(NI,0,S,SI) :- bestSucc@NI(NI,S,SI).
notifyPredecessor@NI(NI,P,PI),
pred@NI(NI,P1,PI1), ((PI1 == "-") || (P in
(P1,N))).
/** Successor eviction */
S1 succCount@NI(NI,count<*>) :- succ@NI(NI,S,SI).
/** Connectivity Monitoring */
S2 evictSucc@NI(NI) :- succCount@NI(NI,C), C > 4.
S3 maxSuccDist@NI(NI,max<D>) :- succ@NI(NI,S,SI),
CM0 pingEvent@NI(NI,E) :- periodic@NI(NI,E,5).
node@NI(NI,N), evictSucc@NI(NI), D:=S - N - 1.
CM1 pendingPing@NI(NI,PI,E) :- pingEvent@NI(NI,E),
S4 delete succ@NI(NI,S,SI) :- node@NI(NI,N),
pingNode@NI(NI,PI).
succ@NI(NI,S,SI), maxSuccDist@NI(NI,D), D == S -
CM2 pingReq@PI(PI,NI,E) :- pendingPing@NI(NI,PI,E).
N - 1.
CM3 delete pendingPing@NI(NI,PI,E) :-
pingResp@NI(NI,PI,E).
/** Finger fixing */
CM4 pingResp@RI(RI,NI,E) :- pingReq@NI(NI,RI,E).
CM5 pingNode@NI(NI,SI) :- succ@NI(NI,S,SI), SI !=
F0 nextFingerFix@NI(NI, 0).
NI.
F1 fFix@NI(NI,E,I) :- periodic@NI(NI,E,10),
CM6 pingNode@NI(NI,PI) :- pred@NI(NI,P,PI), PI !=
nextFingerFix@NI(NI,I).
NI, PI != "-".
F2 fFixEvent@NI(NI,E,I) :- fFix@NI(NI,E,I).
CM7 succ@NI(NI,S,SI) :- succ@NI(NI,S,SI),
F3 lookup@NI(NI,K,NI,E) :- fFixEvent@NI(NI,E,I),
pingResp@NI(NI,SI,E).
node@NI(NI,N), K:=1I << I + N.
CM8 pred@NI(NI,P,PI) :- pred@NI(NI,P,PI),
F4 eagerFinger@NI(NI,I,B,BI) :- fFix@NI(NI,E,I),
pingResp@NI(NI,PI,E).
lookupResults@NI(NI,K,B,BI,E).
CM9 pred@NI(NI,"-","-") :- pingEvent@NI(NI,E),
F5 finger@NI(NI,I,B,BI) :-
pendingPing@NI(NI,PI,E), pred@NI(NI,P,PI).
eagerFinger@NI(NI,I,B,BI).
F6 eagerFinger@NI(NI,I,B,BI) :- node@NI(NI,N),
eagerFinger@NI(NI,I1,B,BI), I:=I1 + 1, K:=1I << I
+ N, K in (N,B), BI != NI.
F7 delete fFix@NI(NI,E,I1) :-
eagerFinger@NI(NI,I,B,BI), fFix@NI(NI,E,I1), I >
0, I1 == I - 1.
F8 nextFingerFix@NI(NI,0) :-
eagerFinger@NI(NI,I,B,BI), ((I == 159) || (BI ==
NI)).
F9 nextFingerFix@NI(NI,I) :- node@NI(NI,N),
eagerFinger@NI(NI,I1,B,BI), I:=I1 + 1, K:=1I << I
+ N, K in (B,N), NI != BI.
/** Churn Handling */
C1 joinEvent@NI(NI,E) :- join@NI(NI,E).
C2 joinReq@LI(LI,N,NI,E) :- joinEvent@NI(NI,E),
node@NI(NI,N), landmark@NI(NI,LI), LI != "-".
C3 succ@NI(NI,N,NI) :- landmark@NI(NI,LI),
joinEvent@NI(NI,E), node@NI(NI,N), LI == "-".
C4 lookup@LI(LI,N,NI,E) :- joinReq@LI(LI,N,NI,E).
C5 succ@NI(NI,S,SI) :- join@NI(NI,E),
lookupResults@NI(NI,K,S,SI,E).
/** Stabilization */
SB0 pred@NI(NI,"-","-").
16