A Java Fork/join Framework
A Java Fork/Join Framework
Doug Lea
State University of New York at
Oswego
Oswego NY 13126
315−341−2688
dl@cs.oswego.edu
ABSTRACT
Some associated programming techniques and examples are
discussed in section 4.4 of Concurrent Programming in Java,
This
paper
describes
the
design,
implementation,
and
second edition [7]. This paper discusses the design (section 2),
performance of
a Java framework for
supporting a style of
implementation (section 3), and performance (section 4) of
parallel programming in which problems are solved by
FJTask, a JavaTM framework that supports this programming
(recursively) splitting them into subtasks that are
solved in
style. FJTask is available as part of the util.concurrent
parallel, waiting for them to complete, and then composing
package from http://gee.cs.oswego.edu.
results. The general design is a variant of the work−stealing
framework
devised
for
Cilk.
The
main
implementation
2. DESIGN
techniques surround efficient construction and management of
Fork/join programs can be run using any framework that
tasks queues and worker threads. The measured performance
supports construction of subtasks that are executed in parallel,
shows good parallel speedups for most programs, but also
along with a mechanism for waiting out their completion.
suggests possible improvements.
However, the java.lang.Thread class (as well as POSIX
pthreads, upon which Java threads are often based) are
suboptimal vehicles for supporting fork/join programs:
1. INTRODUCTION
•
Fork/join tasks have simple and regular synchronization and
Fork/Join parallelism is among the simplest and most effective
management
requirements.
The
computation
graphs
design techniques for obtaining good parallel performance.
produced by fork/join tasks admit much more efficient
Fork/join algorithms are parallel versions of familiar divide−
scheduling tactics than needed for general−purpose threads.
and−conquer algorithms, taking the typical form:
For example, fork/join tasks never need to block except to
wait out subtasks. Thus, the overhead and bookkeeping
Result solve(Problem problem) {
necessary for tracking blocked general−purpose threads are
wasted.
if (problem is small)
•
directly solve problem
Given reasonable base task granularities, the cost of
constructing and managing a thread can be greater than the
else {
computation time of the task itself. While granularities can
split problem into independent parts
and should be subject to tuning when running programs on
fork new subtasks to solve each part
particular platforms, the extremely coarse granularities
join all subtasks
necessary to outweigh thread overhead limits opportunities
for exploiting parallelism.
compose result from subresults
In short, standard thread frameworks are just too heavy to
}
support most fork/join programs. But since threads form the
}
basis of
many other
styles of
concurrent
and parallel
programming as well, it is impossible (or at least impractical) to
The fork operation starts a new parallel fork/join subtask. The
remove overhead or tune scheduling of threads themselves just
join operation causes the current task not to proceed until the
for the sake of supporting this style.
forked subtask has completed. Fork/join algorithms, like other
While the ideas surely have a longer heritage, the first published
divide−and−conquer algorithms, are nearly always recursive,
framework to offer systematic solutions to these problems was
repeatedly splitting subtasks until they are small enough to solve
Cilk[5]. Cilk and other lightweight executable frameworks layer
using simple, short sequential methods.
special−purpose fork/join support on top of an operating
system’s basic thread or process mechanisms. This tactic applies
equally well to Java, even though Java threads are in turn
layered onto lower−level OS capabilities. The main advantage
of creating such a Java lightweight execution framework is to
enable fork/join programs to be written in a more portable
fashion and to run on the wide range of
systems supporting
JVMs.
The FJTask framework is based on a variant of the design
used in Cilk. Other variants are seen in Hood[4], Filaments[8],
stackthreads[10], and related systems relying on lightweight
As the standard example of how this framework appears to the
class ATask
extends FJTask {
programmer, here is a class computing the Fibonacci function:
public void run() {
class Fib extends FJTask {
split...
Task
static final int threshold = 13;
fork...
volatile int number; // arg/result
join...
Fib(int n) { number = n; }
compose...
}
int getAnswer() {
}
Task
if (!isDone())
throw new IllegalStateException();
return number;
}
public void run() {
int n = number;
Task
if (n <= threshold) // granularity ctl
number = seqFib(n);
else {
Fib f1 = new Fib(n − 1);
Fib f2 = new Fib(n − 2);
Worker
Worker
coInvoke(f1, f2);
number = f1.number + f2.number;
executable tasks. All of these frameworks map tasks to threads
}
}
in about the same way that operating systems map threads to
CPUs, but exploit the simplicity, regularity, and constraints of
public static void main(String[] args) {
fork/join programs in performing the mapping. While all of
try {
these frameworks can accommodate (to varying extents) parallel
int groupSize = 2; // for example
programs written in different styles, they optimize for fork/join
FJTaskRunnerGroup group =
designs:
new FJTaskRunnerGroup(groupSize);
Fib f = new Fib(35); // for example
•
A pool of worker threads is established. Each worker thread
group.invoke(f);
is a standard ("heavy") thread (here, an instance of Thread
int result = f.getAnswer();
subclass FJTaskRunner) that processes tasks held in
System.out.println("Answer: " +
result);
queues. Normally, there are as many worker threads as there
}
are CPUs on a system. In native frameworks such as Cilk,
catch (InterruptedException ex) {}
these are mapped to kernel threads or lightweight processes,
}
and in turn to CPUs. In Java, the JVM and OS must be
trusted to map these threads to CPUs. However, this is a
int seqFib(int n) {
very simple task for the OS, since these threads are
if (n <= 1) return n;
else return seqFib(n−1) + seqFib(n−2);
computationally intensive. Any reasonable mapping strategy
}
will map these threads to different CPUs.
}
•
All fork/join tasks are instances of a lightweight executable
This version runs at least thirty times faster than an equivalent
class, not instances of threads. In Java, independently
program
in
which
each
new
task
is
run
in
a
new
executable tasks must implement interface Runnable and
java.lang.Thread on the platforms described in section 4.
define a run method. In the FJTask framework, these
It does so while maintaining the intrinsic portability of
tasks subclass FJTask rather than subclassing Thread,
multithreaded Java programs. There are only two tuning
both of which implement Runnable. (In both cases, a class
parameters of typical interest to programmers:
can alternatively implement Runnable and then supply
•
instances to be run within executing tasks or threads.
The number of worker threads to construct, which should
Because tasks operate under restricted rules supported by
ordinarily correspond to the number of CPUs available on a
platform (or fewer, to reserve processing for other unrelated
FJTask methods, it is much more convenient to subclass
purposes,
or
occasionally
more,
to
soak
up
non−
FJTask, so as to be able to directly invoke them.)
computational slack).
•
A special purpose queuing and scheduling discipline is used
•
to manage tasks and execute them via the worker threads
A granularity parameter that represents the point at which
(see section 2.1). These mechanics are triggered by those
the overhead of generating tasks outweighs potential
few methods provided in the task class: principally
parallelism benefits. This parameter is typically more
fork,
algorithm−dependent
than
platform−dependent.
It
is
join, isDone (a completion status indicator), and some
convenience methods such as
normally possible to settle on a threshold that achieves good
coInvoke that forks then
joins two or more tasks.
results when run on a uniprocessor, yet still exploits
multiple CPUs when they are present. As a side−benefit, this
•
A
simple
control
and
management
facility
(here,
approach meshes well with JVM dynamic compilation
FJTaskRunnerGroup) sets up worker pools and initiates
mechanics
that
optimize
small
methods
better
than
execution of a given fork/join task when invoked from a
monolithic procedures. This, along with data locality
normal thread (such as the one performing main in a Java
advantages, can cause fork/join algorithms to outperform
program).
other kinds of algorithms even on uniprocessors.
2.1 Work−Stealing
It reduces contention by having stealers operate on the opposite
The heart of a fork/join framework lies in its lightweight
side of the deque as owners. It also exploits the property of
scheduling mechanics.
recursive divide−and−conquer algorithms of generating "large"
FJTask
adapts the basic tactics
pioneered in the Cilk work−stealing scheduler:
tasks early. Thus, an older stolen task is likely to provide a
larger unit of work, leading to further recursive decompositions
•
Each worker thread maintains runnable tasks in its own
by the stealing thread.
scheduling queue.
As one consequence of these rules, programs that employ
•
Queues are maintained as double−ended queues (i.e.,
relatively small task granularities for base actions tend to run
deques, usually pronounced "decks"), supporting both LIFO
faster than those that only use coarse−grained partitioning or
push and pop operations, as well as a FIFO
take
those that do not use recursive decomposition. Even though
operation.
relatively few tasks are stolen in most fork/join programs,
•
Subtasks generated in tasks run by a given worker thread
creating many fine−grained tasks means that a task is likely to
are pushed onto that workers own deque.
be available whenever a worker thread is ready to run it.
•
Worker
threads
process
their
own
deques
in
LIFO
3. IMPLEMENTATION
(youngest−first) order, by popping tasks.
The framework has been implemented in about 800 lines of pure
•
When a worker thread has no local tasks to run, it attempts
Java code, mainly in class FJTaskRunner, a subclass of
to take ("steal") a task from another randomly chosen
java.lang.Thread. FJTasks themselves maintain only a
worker, using a FIFO (oldest first) rule.
boolean completion status, and perform all other operations via
delegation
to
their
current
worker
threads.
The
•
When a worker thread encounters a join operation,
it
FJTaskRunnerGroup
class serves to construct worker
processes other tasks, if available, until the target task is
threads, maintains some shared state (for example, the identities
noticed to have completed (via isDone). All tasks
of all worker threads, needed for steal operations), and helps
otherwise run to completion without blocking.
coordinate startup and shutdown.
•
When a worker thread has no work and fails to steal any
More detailed implementation documentation is available inside
from others, it backs off (via yields, sleeps, and/or priority
the util.concurrent package. This section discusses only
adjustment − see section 3) and tries again later unless all
two
sets
of
problems
and
solutions
encountered
when
workers are known to be similarly idle, in which case they
implementing this framework: Supporting efficient deque
all block until another task is invoked from top−level.
operations (push, pop, and take), and managing the steal
protocol by which threads obtain new work.
3.1 Deques
To enable efficient and scalable execution, task management
Stealing
must be made as fast as possible. Creating, pushing, and later
popping (or, much less frequently, taking) tasks are analogs of
procedure call overhead in sequential programs. Lower overhead
enables programmers to adopt smaller task granularities, and in
turn better exploit parallelism.
Task allocation itself is the responsibility of the JVM. Java
garbage collection relieves us of needing to create a special−
purpose memory allocator to maintain tasks. This substantially
reduces the complexity and lines of code needed to implement
Pushing
FJTasks compared to similar frameworks in other languages.
The basic structure of the deque employs the common scheme
Deque
of using a single (although resizable) array per deque, along
Base
Top
with two indices: The top index acts just like an array−based
stack pointer, changing upon push and pop. The base index
is modified only by take. Since FJTaskRunner operations
are all intimately tied to the concrete details of the deque (for
example, fork simply invokes push), this data structure is
directly embedded in the class rather than being defined as a
Popping
separate component.
Because the deque array is accessed by multiple threads,
sometimes without full synchronization (see below), yet
individual
Java
array
elements
cannot
be
declared
as
volatile, each array element is actually a fixed reference to a
little forwarding object maintaining a single volatile
reference. This decision was made originally to ensure
As discussed in more detail in [5], the use of LIFO rules for
conformance with Java memory rules, but the level of
each thread processing its own tasks, but FIFO rules for stealing
indirection that it entails turns out to improve performance on
other tasks is optimal for a wide class of recursive fork/join
tested platforms, presumably by reducing cache contention due
designs. Less formally, the scheme offers two basic advantages:
to accesses of nearby elements, which are spread out a bit more
of a main task, upon its completion, and around global full−stop
in memory due to the indirection.
synchronization points employed in some fork/join algorithms.
The main challenges in deque implementation surround
The main issue here is what to do when a worker thread has no
synchronization and its avoidance. Even on JVMs with
local tasks and cannot steal one from any other thread. If the
optimized synchronization facilities[2], the need to obtain locks
program is running on a dedicated multiprocessor, then one
for every push and pop operation becomes a bottleneck.
could make the case for relying on hard busy−wait spins looping
However, adaptations of tactics taken in Cilk[5] provide a
to try to steal work. However, even here, attempted steals
solution based on the following observations:
increase contention, which can slow down even those threads
•
The
that are not idle (due to locking protocols in section 3.1).
push and pop operations are only invoked by owner
Additionally, in more typical usage contexts of this framework,
threads.
the operating system should somehow be convinced to try to run
•
Access to the take operation can easily be confined to one
other unrelated runnable processes or threads.
stealing thread at a time via an entry lock on take. (This
The tools for achieving this in Java are weak, have no
deque lock also serves to disable take operations when
guarantees (see [6, 7]), but usually appear to be acceptable in
necessary.) Thus, interference control is reduced to a two−
practice (as do similar techniques described for Hood[3]). A
party synchronization problem.
thread that fails to obtain work from any other thread lowers its
•
The pop and take operations can only interfere if the
priority
before
attempting
additional
steals,
performs
deque is about to become empty. Otherwise they are
Thread.yield between attempts, and registers itself as
guaranteed to operate on disjoint elements of the array.
inactive in its FJTaskRunnerGroup. If all others become
Defining the top and base indices as volatile ensures that
inactive, they all block waiting for additional main tasks.
a pop and take can proceed without locking if the deque is
Otherwise, after a given number of additional spins, threads
sure to have more than one element. This is done via a Dekker−
enter a sleeping phase, where they sleep (for up to 100ms) rather
like algorithm in which push pre−decrements top:
than yield between steal attempts. These imposed sleeps can
if (−−top >= base) ...
cause artificial lags in programs that take a long time to split
and take pre−increments base:
their tasks. But this appears to be the best general−purpose
if (++base < top) ...
compromise. Future versions of the framework may supply
In each case they must then check to see if this could have
additional control methods so that programmers can override
caused the deque to become empty by comparing the two
defaults when they impact performance.
indices. An asymmetric rule is used upon potential conflict: pop
4. PERFORMANCE
rechecks state and tries to continue after obtaining the deque
lock (the same one as held by take), backing off only if the
In these times of nearly continuous performance improvements
deque is truly empty. A take operation instead just backs off
of compilers and JVMs, performance measurements are only of
immediately, typically then trying to steal from a different
transient value. However, the metrics reported in this section
victim. This asymmetry represents the only significant departure
reveal some basic properties of the framework.
from the otherwise similar THE protocol used in Cilk.
A collection of seven fork/join test programs are briefly
The use of volatile indices also enables the push operation
described in the following table. These programs are adaptations
to proceed without synchronization unless the deque array is
of those available as demos in the util.concurrent
about to overflow, in which case it must first obtain the deque
package. They were selected to show some diversity in the kinds
lock to resize the array. Otherwise, simply ensuring that top is
of problems that can be run within this framework, as well as to
updated only after the deque array slot is filled in suppresses
obtain results for some common parallel test programs.
interference by any take.
Program Description
Subsequent to initial implementation, it was discovered that
Fib The Fibonnaci program shown in section 2,
several JVMs do not conform to the Java Memory Model [6]
run with argument 47 and granularity
rule requiring accurate reads after writes of pairs of volatile
threshold 13.
fields. As a workaround, the criterion for pop to retry under
lock was adjusted to trigger if there appear to be two or fewer
Integrate
Recursive
Gaussian
quadrature
of
elements, and the take operation added a secondary lock to
2⋅i 1 ⋅x 2⋅i 1 summing over odd
ensure a memory barrier. This suffices as long as at most one
index change is missed by the owner thread (which holds here
values of i from 1 to 5 and integrating from
for platforms that otherwise maintain proper memory order
−47 to 48.
when reading volatile fields), and causes only a tiny
Micro
Best−move finder for a board game, run
slowdown in performance.
with a lookahead of 4 moves.
3.2 Stealing and Idling
Sort
Merge/Quick sort (based on an algorithm
Worker threads in work−stealing frameworks know nothing
from Cilk) of 100 million numbers.
about the synchronization demands of the programs they are
MM
Multiply 2048 X 2048 matrices of doubles.
running. They simply generate, push, pop, take, manage the
status of, and execute tasks. The simplicity of this scheme leads
LU
Decompose 4096 X 4096 matrix of doubles.
to efficient execution when there is plenty of work for all
Jacobi
Iterative mesh relaxation with barriers: 100
threads. However, this streamlining comes at the price of relying
steps of nearest neighbor averaging on a
on heuristics when there is not enough work; i.e., during startup
4096 X 4096 matrix of doubles.
For the main tests, programs were run on a 30−CPU Sun
However, in general, the results show that increasing the number
Enterprise 10000 running the Solaris Production 1.2 JVM (an
of threads reliably increased the number of CPUs employed.
early version of the 1.2.2_05 release) on Solaris 7. The JVM
Speedups are reported as Timen / Time1. The best overall
was run with environment parameters selecting "bound threads"
speedup was seen for the integration program (speedup of 28.2
for thread mappings, and memory parameters discussed in
for 30 threads). The worst was for the LU decomposition
section 4.2. A few additional measurements reported below were
program (speedup of 15.35 for 30 threads).
run on a 4−CPU Sun Enterprise 450.
Another way of measuring scalability is in terms of task rates,
the average time taken to execute a single task (which may be
Times
either a recursive or a leaf step). The following figure shows
data from a single instrumented run capturing task rates. Ideally,
700
the numbers of tasks processed per unit time per thread should
be constant. The fact that they generally slightly decrease with
600
numbers of threads indicates some scalability limitations. Note
500
the fairly wide difference in task rates, reflecting differences in
task granularities. The smallest tasks sizes are seen in Fib,
400
which even with a threshold setting of 13, generated and
onds
executed a total of 2.8 million tasks per second when run with
c
e 300
30 threads.
S
200
Task rates
120000
100
0
100000
d
a
Fib
Micro Integ
MM
LU
Jaco
Sort
re
h
bi
80000
r
t
Programs were run with very large input parameters in order to
e
minimize timer granularity and JVM warm−up artifacts. Some
60000
c p
other warm−up effects were avoided by running an initial
se
problem set before starting timers. Most data are medians of
40000
three runs, but some (including most follow−up measurements
sks/
a
in sections 4.2−4.4) only reflect single runs so are a bit noisy.
T
20000
4.1 Speedups
Scalability measurements were obtained by running the same
0
problem set using worker thread groups of size 1 ... 30. There is
Fib
Micr Inte−
MM
LU
Jaco
Sort
o
grat
bi
Four factors appear to account for tail−offs in speedups in those
Speedups
programs that did not scale linearly. Three of them are common
30
to any parallel framework, but we’ll start with one unique to
FJTask (as opposed to Cilk etc), GC effects.
25
Ideal
4.2 Garbage Collection
20
Fib
In many ways, modern GC facilities are perfect matches to
Micro
fork/join frameworks: These programs can generate enormous
15
numbers of tasks, nearly all of which quickly turn into garbage
Integ
peedups
after they are executed. At any given time, deterministic
S 10
MM
fork/join programs require only at most p (where p is the
LU
number of threads) times the bookkeeping space that would be
5
Jacobi
required
for
sequential
versions
of
these
programs[10].
Generational semispace copying collectors (including the one in
Sort
0
the JVM used for these measurements − see [1]) cope with this
123456789111111111122222222223
well because they only traverse and copy non−garbage cells. By
012345678901234567890
doing so, they evade one of the trickiest problems in manual
parallel memory management, tracing memory that is allocated
Threads
by one thread but then used by another. The garbage collector is
no way to know whether the JVM always mapped each thread to
oblivious to the origin of memory, so need not deal with such
a different CPU when available. While there is no evidence
issues.
about this either way, it is possible that the lags to map new
As a simple indicator of the general superiority of generational
threads to CPUs increased with numbers of threads and/or
copying collection, a four−thread run of Fib that executes in 5.1
varied systematically across the different test programs.
seconds when using the memory settings used in the main
experiments reported here runs in 9.1 seconds if the generational
GC Effects: Fib
Memory bandwidth effects: Sorting
30
30
25
25
20
20
Ideal
Ideal
15
15
Bytes
Fib−64m
peedup
peedup
Shorts
S
Fib−4m
S
10
10
Ints
Fib−scaled
Longs
5
5
0
0
1 3 5 7 9 1 1 1 1 1 2 2 2 2 2
123456789111111111122222222223
1 3 5 7 9 1 3 5 7 9
012345678901234567890
Threads
Threads
copying phase is disabled on this JVM (in which case this JVM
4.4 Task Synchronization
relies entirely on mark−sweep).
As
discussed
in
section
3.2,
work−stealing
frameworks
However, these GC mechanisms turn into scaling problems
sometimes encounter problems dealing with frequent global
when memory is allocated at such a high rate that threads must
synchronization of tasks. The worker threads continue to poll for
be stopped nearly continuously to perform collections. The
more work even though there isn’t any, thus generating
following figure shows the difference in speedups across three
contention, and, in FJTask, sometimes even forcing threads
memory settings (this JVM supports optional arguments to set
into idle sleeps.
memory parameters): with the default 4 megabyte semispaces,
The Jacobi program illustrates some of the consequent issues.
with 64 megabyte semispaces, and with the amount of memory
This program performs 100 steps, where in each step, all cells
scaled to the number of threads (2 + 2p megabytes). With
are updated according to a simple nearest neighbor averaging
smaller semispaces, the overhead of stopping threads and
formula. A global (tree−based) barrier separates each step. To
collecting garbage starts to impact scaling as the garbage−
determine the magnitude of synchronization effects, a version of
generation rate climbs due to additional threads.
In light of these results, 64m semispaces were used for all other
Sync Effects: Jacobi
test runs. A better policy would have been to scale the amount
of memory to the number of threads in each test. (As seen in the
30
figure, this would have caused all speedups to appear slightly
more linear.). Alternatively, or in addition, program−specific
25
task granularity thresholds could be increased proportionally
with numbers of threads.
20
Ideal
4.3 Memory Locality and Bandwidth
15
1step/sync
Four of the test programs create and operate on very large
peedup
S
10steps/−
shared arrays or matrices: Sorting numbers, and multiplying,
10
sync
decomposing or performing relaxation on matrices. Of these,
sorting is probably the most sensitive to the consequences of
5
having to move data around processors, and thus to the
aggregate memory bandwidth of the system as a whole. To help
0
determine the nature of these effects, the Sort program was
1 3 5 7 9 1 1 1 1 1 2 2 2 2 2
recast into four versions, that sorted bytes, shorts, ints, and
1 3 5 7 9 1 3 5 7 9
longs respectively. Each version sorted data only in the range of
0...255, to ensure that they were otherwise equivalent. The wider
Threads
the data elements, the more memory traffic.
this program was written to synchronize only after every 10
The results show that increased memory traffic leads to poorer
steps. The scaling differences show the impact of the current
speedups, although they do not provide definitive evidence that
policy, and indicate the need to include additional methods in
this is the only cause of tail−offs.
future versions of this framework to allow programmers to
override default parameters and policies. (Note however that
Element width also impacts absolute performance. For example,
this graph probably slightly exaggerates pure synchronization
using only one thread, sorting bytes took 122.5 seconds, while
effects, since the 10−step version is also likely to maintain
sorting longs took 242.4 seconds.
somewhat greater task locality.)
4.5 Task Locality
Other Frameworks
FJTask, like other fork/join frameworks, is optimized for the
case where worker threads locally consume the vast majority of
8
the tasks they create. When this does not occur, performance can
suffer, for two reasons:
7
•
Stealing tasks encounters much more overhead than does
6
popping tasks.
FJTask
•
Cilk
In most programs in which tasks operate on shared data,
5
running your own subdivided task is likely to maintain better
Hood
onds
locality of data access.
4
c
Stack−
e
threads
S
Locality effects
3
Filaments
0.225
2
0.2
1
0.175
Fib
t
olen
0
0.15
Micro
Fib
MM Sort
LU
Intg
Jac
0.125
Integrate
different choices among these than used here could produce
t
ion s
0.1
MM
nearly
any
relative
performance
ranking
among
these
0.075
LU
frameworks in many high−performance applications.
r
opor
0.05
P
Jacobi
FJTask generally performs worse (on this JVM) on programs
0.025
that mainly do floating point computations on arrays and
Sort
matrices. Even though JVMs are improving, they are still not
0
always competitive with powerful back−end optimizers used in
1 3 5 7 9 1 1 1 1 1 2 2 2 2 2
C and C++ programs. Although not shown in this chart,
1 3 5 7 9 1 3 5 7 9
FJTask versions of all programs ran faster than versions of
programs in these other frameworks when they were compiled
Threads
without optimization enabled, and some informal tests suggest
As seen in the figure, in most programs, the relative number of
that most of the remaining differences are due to array bounds
stolen tasks is at most a few percent. However, the LU and MM
checks and related runtime obligations. These are, of course,
programs generate larger imbalances in workloads (and thus
issues and problems that are attracting much attention and effort
relatively more stealing) as the number of threads increase. It is
by JVM and compiler developers. The observed differences in
possible that some algorithmic tuning of these programs could
code quality for computationally intensive programs are likely
reduce this effect, and in turn lead to better speedups.
to diminish.
4.6 Comparisons to Other Frameworks
5. CONCLUSIONS
It is impossible to perform definitive or even very meaningful
This paper has demonstrated that it is possible to support
measurements
comparing programs
across languages
and
portable, efficient, scalable parallel processing in pure Java, and
frameworks. However, relative measurements can at least show
to provide a convenient API for programmers who can exploit
some of the basic advantages and limitations of the FJTask
the framework by following only a few simple design rules and
framework versus similar frameworks written in other languages
design patterns (as presented and discussed in [7]). The
(here, C and C++). The following chart compares performance
observed performance characteristics of the sample programs
of programs that were either based on or are similar to those
measured here both provide further guidance for users of the
supplied with the Cilk, Hood, Stackthreads, and/or Filaments
framework, and suggest some areas of potential improvement to
distributions. All of these were run with 4 threads on a 4−CPU
the framework itself.
Enterprise 450. To avoid reconfiguring other frameworks or
Even though scalability results were shown only for a single
their test programs, all tests were run with smaller problems sets
JVM, some of the main empirical findings should hold more
than used above. All results represent the best of three runs,
generally:
using the compiler and run−time settings that appeared to
provide the fastest times. Fib was run without any granularity
•
While
generational
GC
generally
meshes
well
with
threshold; i.e., an implicit threshold of 1. (The "prune" setting in
parallelism,
it
can
hinder
scalability
when
garbage
Filaments Fib was set to 1024, which causes it to behave more
generation rates force very frequent collections. On this
similarly to the other versions.)
JVM, the underlying cause appears to be that stopping
Speedups going from one to four threads were very similar
threads for collection takes time approximately proportional
(between 3.0 and 4.0) across the different frameworks and test
to the number of running threads. Because more running
programs, so the accompanying chart focuses on differences in
threads generate more garbage per unit time, overhead can
absolute performance. However, because the multithreading
climb approximately quadratically with numbers of threads.
aspects of all these frameworks are all fast, most of these
Even so, this significantly impacts performance only when
differences reflect differences in application−specific code
the GC rate is relatively high to begin with. However, the
quality
stemming
from
different
compilers,
optimization
resulting issues invite further research and development of
switches, and configuration parameters. In fact, it is likely that
concurrent and parallel GC algorithms. The results presented
here additionally demonstrate the desirability of providing
7. REFERENCES
tuning options and adaptive mechanisms on multiprocessor
[1] Agesen, Ole, David Detlefs, and J. Eliot B. Moss. Garbage
JVMs to scale memory to the number of active processors.
Collection
and
Local
Variable
Type−Precision
and
•
Most scalability issues reveal themselves only when
Liveness in Java Virtual Machines. In Proceedings of 1998
programs are run using more CPUs than are even available
ACM SIGPLAN Conference on Programming Language
on most stock multiprocessors. FJTask (as well as other
Design and Implementation (PLDI), 1998.
fork/join frameworks) appear to provide nearly ideal
[2] Agesen, Ole, David Detlefs, Alex Garthwaite, Ross
speedups for nearly any fork/join program on commonly
available 2−way, 4−way, and 8−way SMP machines. The
Knippel, Y.S. Ramakrishna, and Derek White. An Efficient
present paper appears to be the first reporting systematic
Meta−lock for Implementing Ubiquitous Synchronization.
results for any fork/join framework designed for stock
In Proceedings of OOPSLA ’99, ACM, 1999.
multiprocessors run on more than 16 CPUs. Further
[3] Arora, Nimar, Robert D. Blumofe, and C. Greg Plaxton.
measurements are needed to see if the patterns of results
Thread Scheduling for Multiprogrammed Multiprocessors.
seen here hold in other frameworks as well.
In Proceedings of the Tenth Annual ACM Symposium on
•
Characteristics of application programs (including memory
Parallel Algorithms and Architectures (SPAA), Puerto
locality, task locality, use of global synchronization) often
Vallarta, Mexico, June 28 − July 2, 1998.
have more bearing on both scalability and absolute
[4] Blumofe, Robert D. and Dionisios Papadopoulos. Hood: A
performance than do characteristics of the framework, JVM
User−Level
Threads
Library
for
Multiprogrammed
or underlying OS. For example, informal tests showed that
Multiprocessors. Technical Report, University of Texas at
the
careful
avoidance
of
synchronization
in
deques
Austin, 1999.
(discussed in section 3.1) has essentially no impact on
programs with relatively low task generation rates such as
[5] Frigo, Matteo, Charles Leiserson, and Keith Randall. The
LU. However, the focus on keeping task management
Implementation of the Cilk−5 Multithreaded Language. In
overhead to a minimum widens the range of applicability
Proceedings of 1998 ACM SIGPLAN Conference on
and utility of the framework and the associated design and
Programming
Language
Design
and
Implementation
programming techniques.
(PLDI), 1998.
In addition to incremental improvements, future work on this
[6] Gosling, James, Bill Joy, and Guy Steele. The Java
framework may include construction of useful applications (as
Language Specification, Addison−Wesley, 1996.
opposed to demos and tests), subsequent evaluations under
[7] Lea, Doug. Concurrent Programming in Java, second
production program loads, measurements on different JVMs,
edition, Addison−Wesley, 1999.
and development of extensions geared for use with clusters of
multiprocessors.
[8] Lowenthal, David K., Vincent W. Freeh, and Gregory R.
Andrews. Efficient Fine−Grain Parallelism on Shared−
6. ACKNOWLEDGMENTS
Memory Machines. Concurrency−Practice and Experience,
This work was supported in part by a collaborative research
10,3:157−173, 1998.
grant from Sun Labs. Thanks to Ole Agesen, Dave Detlefs,
[9] Simpson, David, and F. Warren Burton. Space efficient
Christine Flood, Alex Garthwaite, and Steve Heller of the Sun
execution
of
deterministic
parallel
programs.
IEEE
Labs Java Topics Group for advice, help, and comments. David
Transactions on Software Engineering, December, 1999.
Holmes, Ole Agesen, Keith Randall, Kenjiro Taura, and the
anonymous referees provided useful comments on drafts of this
[10] Taura, Kenjiro, Kunio Tabata, and Akinori Yonezawa.
paper. Bill Pugh pointed out the read−after−write limitations of
"Stackthreads/MP:
Integrating
Futures
into
Calling
JVMs discussed in section 3.1. Very special thanks to Dave Dice
Standards." In Proceedings of ACM SIGPLAN Symposium
for reserving time and performing test runs on the 30−way
on Principles & Practice of Parallel Programming
Enterprise.
(PPoPP), 1999.