Original PDF Flash format High-level-Real-time-Programming-in-Java  


High Level Real Time Programming In Java

High-level Real-time Programming in Java
David F. Bacon
Perry Cheng
David Grove
Michael Hind
V.T. Rajan
Eran Yahav
IBM T.J. Watson Research Center
Matthias Hauswirth
Christoph M. Kirsch
Daniel Spoonhower
Martin T. Vechev
Universit `a della Svizzera
Universit ¨at Salzburg
Carnegie Mellon University
University of Cambridge
ABSTRACT
1.
INTRODUCTION
Real-time systems have reached a level of complexity beyond
Real-time systems are rapidly becoming both more com-
the scaling capability of the low-level or restricted languages
plex and more pervasive.
traditionally used for real-time programming.
Traditional real-time programming methodologies have re-
While Metronome garbage collection has made it practical
volved around relatively simple systems. This has meant
to use Java to implement real-time systems, many challenges
that it was possible to use restrictive programming method-
remain for the construction of complex real-time systems,
ologies with deterministic, statically verifiable properties or
some specific to the use of Java and others simply due to
very low-level programming techniques amenable to cycle-
the change in scale of such systems.
accurate timing analysis [18].
The goal of our research is the creation of a comprehensive
However, those techniques do not scale to the large, com-
Java-based programming environment and methodology for
plex systems that are now beginning to be built. The lack of
the creation of complex real-time systems. Our goals include
scaling is manifested at both the theoretical and the practi-
construction of a provably correct real-time garbage collec-
cal level. Basic principles of undecidability mean that as the
tor capable of providing worst case latencies of 100 µs, capa-
software increases in size, the required expressiveness yields
ble of scaling from sensor nodes up to large multiprocessors;
a system that can not be statically verified. Basic principles
specialized programming constructs that retain the safety
of software engineering mean that low-level programming is
and simplicity of Java, and yet provide sub-microsecond la-
untenable at the resulting scale.
tencies; the extension of Java’s “write once, run anywhere”
As a result there is broad interest in using Java for a wide
principle from functional correctness to timing behavior; on-
variety of both soft- and hard-real-time systems. Java pro-
line analysis and visualization that aids in the understanding
vides two main advantages: a scalable, safe, high-level pro-
of complex behaviors; and a principled probabilistic analy-
gramming model, and a huge body of software components
sis methodology for bounding the behavior of the resulting
that can be used to compose the soft- and non-real-time
systems.
portions of the system. Being able to use a single language
While much remains to be done, this paper describes the
across all domains of the system provides enormous benefits
progress we have made towards these goals.
in simplicity, re-usability, and staffing and training require-
ments.
Categories and Subject Descriptors:
C.3 [Special-
Java’s high level of safety and security comes from a com-
Purpose and Application-Based Systems]: Real-time and
bination of both static and dynamic checking. Although
embedded systems; D.3.2 [Programming Languages]: Java;
dynamic checks typically reduce execution-time determin-
D.3.3 [Programming Languages]: Language Constructs and
ism, in a large-scale mission- or safety-critical system the
Features—Dynamic storage management; D.3.4 [Program-
reliability they provide is essential.
ming Languages]: Processors—Memory management (gar-
The goal of the Metronome project at IBM Research is to
bage collection) D.4.7 [Operating Systems]: Organization
make Java suitable for programming real-time systems. Our
and Design—Real-time systems and embedded systems
goal encompasses both hard- and soft-real-time systems, at
General Terms: Experimentation, Languages, Measure-
time scales as low as those achievable by any software sys-
ment, Performance
tem.
The largest potential source of non-determinism in Java
Keywords: Scheduling, Allocation, WCET, Tasks, Visual-
is garbage collection. In previous work we addressed this
ization
issue with the development of a true real-time garbage col-
lector which is capable of providing latency, utilization, and
throughput guarantees that make it suitable for program-
Permission to make digital or hard copies of all or part of this work for
ming systems with periods as low as 5 milliseconds [2].
personal or classroom use is granted without fee provided that copies are
This technology forms the basis of a new real-time Java
not made or distributed for profit or commercial advantage and that copies
virtual machine product being developed by IBM, and is un-
bear this notice and the full citation on the first page. To copy otherwise, to
der evaluation by various customers in the defense, telecom-
republish, to post on servers or to redistribute to lists, requires prior specific
munications, and embedded systems businesses.
permission and/or a fee.
However, significant challenges remain in real-time garbage
EMSOFT’05, September 19–22, 2005, Jersey City, New Jersey, USA.
Copyright 2005 ACM 1-59593-091-4/05/0009 ...
collection, in real-time Java, and in complex real-time sys-
$5.00.

3000
1
2500
0.8
2000
0.6
1500
Count
Utilization
0.4
1000
500
0.2
0
0
0
0.5
1
1.5
2
2.5
0
1
2
3
4
5
6
7
8
9
Pause Time (ms)
10
Time (cycles)
x 10
Figure 1: Pause time distributions for javac in the
Figure 2:
CPU utilization for javac under the
J9 implementation of Metronome, with target max-
Metronome. Mutator interval is 7 ms, collector in-
imum pause time of 3 ms and a “beat” size of 500 µs.
terval is 3 ms, for an overall utilization target of
The actual maximum pause is 2.4 ms.
70%; the collector achieves this within 0.8% jitter.
tems in general. This paper describes our ongoing research
pages, and each page is divided into blocks of a partic-
across these various problem domains.
ular size. Objects are allocated from the smallest size
class that can contain the object.
2.
THE METRONOME COLLECTOR
Mostly Non-copying. Since fragmentation is rare, objects
In this section we describe our previous work on the Metro-
are usually not moved.
nome, which forms the heart of the real-time Java system
we are developing [2, 1]. Metronome was originally imple-
Defragmentation. If a page becomes fragmented due to
mented in Jikes RVM [16]; a second generation implemen-
garbage collection, its objects are moved to another
tation of the Metronome is underway in IBM’s J9 virtual
(mostly full) page.
machine. We describe the algorithm and engineering of the
collector in sufficient detail to serve as a basis for under-
Read Barrier. Relocation of objects is achieved by using
standing the work described in the rest of this paper.
a forwarding pointer located in the header of each ob-
The Metronome is an incremental collector targeted at
ject [8]. A read barrier maintains a to-space invariant
embedded systems. It uses a hybrid approach of non-copying
(mutators always see objects in the to-space).
mark-sweep (in the common case) and copying collection
Incremental Mark-Sweep. Collection is a standard in-
(when fragmentation occurs).
cremental mark-sweep similar to Yuasa’s snapshot-at-
The original algorithm was designed for uniprocessors.
the-beginning algorithm [27] implemented with a weak
Recently this restriction has been removed and the system
tricolor invariant. We extend traversal during marking
has been applied in server-class systems, as described in Sec-
so that it redirects any pointers pointing at from-space
tion 3.1.
so they point at to-space. Therefore, at the end of a
Metronome uses a snapshot-at-the-beginning algorithm
marking phase, the relocated objects of the previous
that allocates objects black (marked). Although it has been
collection can be freed.
argued that such a collector can increase floating garbage,
the worst-case performance is no different from other ap-
Arraylets. Large arrays are broken into fixed-size pieces
proaches and the termination condition is easier to enforce.
(which we call arraylets) to bound the work of scan-
Other real-time collectors have used a similar approach.
ning or copying an array and to bound external frag-
Figures 1 and 2 show the real-time performance of our col-
mentation caused by large objects.
lector. Pause times are centered around the “beat” which is
the nominal frequency of the underlying scheduler (500µs);
Since our collector is not concurrent, we explicitly control
worst-case latencies are well below the target. Utilization
the interleaving of the mutator and the collector. We use
is high (above the 70%) with minimal jitter. Utilizations of
the term collection to refer to a complete mark/sweep/ de-
100% occur while collection is off. In this section we explain
fragment cycle and the term collector quantum to refer to a
how the Metronome achieves these goals.
scheduler quantum in which the collector runs.
2.1
Features of the Metronome Collector
2.2
Read Barrier
Our collector is based on the following principles:
We use a Brooks-style read barrier [8]: each object con-
tains a forwarding pointer that normally points to itself, but
Segregated Free Lists. Allocation is performed using seg-
when the object has been moved, points to the moved ob-
regated free lists. Memory is divided into fixed-sized
ject.

utilization.
Application
task period = ∆t
The reason for this is simple: work-based algorithms do
Scheduler
m = maximum live memory
a little bit of collection work each time the mutator allo-
a(G) = allocation rate
cates memory. The idea is that by keeping this interruption
memory size = s
short, the work of collection will naturally be spread evenly
utilization = u
throughout the application. Unfortunately, programs are
Collector
not uniform in their allocation behavior over short time
- or -
R = collection rate
utilization =
scales; rather, they are bursty.
As a result, work-based
u
ρ = fragmentation factor
memory size =
strategies suffer from very poor mutator utilization during
s
such bursts of allocation.
In fact, we showed both analytically and experimentally
that work-based collectors are subject to these problems and
Figure 3: Interaction of components in a Metro-
that utilization often drops to 0 at real-time intervals.
nomic virtual machine. Parameters of the applica-
Time-based scheduling simply interleaves the collector and
tion and collector are intrinsic; parameters to the
the mutator on a fixed schedule. While there has been con-
scheduler are user-selected, and are mutually deter-
cern that time-based systems may be subject to space explo-
minant.
sion, we have shown that in fact they are quite stable, and
only require a small number of coarse parameters that de-
scribe the application’s memory characteristics to function
Our collector thus maintains a to-space invariant: the mu-
within well-controlled space bounds.
tator always sees the new version of an object. However, the
2.4
Provable Real-time Bounds
sets comprising from-space and to-space have a large inter-
section, rather than being completely disjoint as in a pure
Our collector achieves guaranteed performance provided
copying collector.
the application is correctly characterized by the user. In
Although we use a read barrier and a to-space invariant,
particular, the user must specify the maximum amount of
our collector does not suffer from variations in mutator uti-
simultaneously live data m as well as the peak allocation
lization because all of the work of finding and moving objects
rate over the time interval of a garbage collection a(∆GC).
is performed by the collector.
The collector is parameterized by its tracing rate R.
Read barriers, especially when implemented in software,
Given these characteristics of the mutator and the collec-
are frequently avoided because they are considered to be
tor, the user then has the ability to tune the performance of
too costly. We have shown that an efficient read barrier im-
the system using three inter-related parameters: total mem-
plementation can be obtained using an optimizing compiler
ory consumption s, minimum guaranteed CPU utilization u,
that is able to optimize the barriers.
and the resolution at which the utilization is calculated ∆t.
We apply a number of optimizations to reduce the cost of
The relationship between these parameters is shown graph-
read barriers, including well-known optimizations like com-
ically in Figure 3. The mutator is characterized by its allo-
mon subexpression elimination and special-purpose optimi-
cation rate over the interval of a garbage collection a(∆GC)
zations like barrier-sinking, in which the barrier is sunk
and by its maximum memory requirement m. The collector
down to its point of use, which allows the null-check re-
is characterized by its collection rate R and a pre-selected
quired by the Java object dereference to be folded into the
fragmentation limit ρ (typically 1/8 worst-case fragmenta-
null-check required by the barrier (since the pointer can be
tion overhead is tolerated). The tunable parameters are ∆t,
null, the barrier cannot perform the forwarding uncondition-
the frequency at which the collector is scheduled, and either
ally).
the CPU utilization level of the application u (in which case
This optimization works with any null-checking approach
a memory size s is determined), or a memory size s which
used by the run-time system, whether via explicit compar-
determines the utilization level u.
isons or implicit traps on null dereferences. The important
In either case both space and time bounds are guaran-
point is that we usually avoid introducing extra explicit
teed.
checks for null, and we guarantee that any exception due
to a null pointer occurs at the same place as it would have
3.
BETTER REAL-TIME COLLECTION
in the original program.
The Metronome system just described provides worst-case
In the Jikes RVM implementation of Metronome, these
latencies of 3 milliseconds, suitable for the majority of real-
optimizations resulted in fairly low read barrier overheads.
time systems. However, there is still room for improvement
On the SPECjvm98 benchmarks, the mean read barrier over-
and areas in which the technology needs to be extended.
head is 4%, or 9.6% in the worst case (in the 201.compress
benchmark). Read barriers are not yet fully operational in
3.1
Multiprocessor Real-time Collection
our J9 implementation, but we will apply similar optimiza-
Large systems which track many concurrent events are of-
tions and expect to achieve similar results.
ten implemented on multiprocessors to achieve the desired
level of performance and scale. The original Metronome al-
2.3
Time-Based Scheduling
gorithm was limited to uniprocessors, making it unavailable
Our collector can use either time- or work-based schedul-
to this significant application domain.
ing. Most previous work on real-time garbage collection,
The fundamental problem is that the Metronome collector
starting with Baker’s algorithm [3], has used work-based
relies on being able to make atomic (but small and bounded)
scheduling. Work-based algorithms may achieve short in-
changes to the heap during its execution quanta. This is
dividual pause times, but are unable to achieve consistent
accomplished by the use of safe points, which are inserted

by the compiler into the application code at points where
phases of collection. In particular, the tracing phase, which
certain invariants hold. A context switch into the collector
is most subject to the fundamental load balancing prob-
may only take place when all threads are at safe points.
lems described above, is amenable to execution in parallel
Safe points are essentially a low-overhead amortized syn-
with the mutators. The phases which are not amenable to
chronization technique. They allow the compiled code to
parallelization with the mutators are stack scanning and de-
perform heap operations without the use of expensive atomic
fragmentation.
operations or locking.
The implementation of Metronome in IBM’s J9 virtual
There are several kinds of synchronization between the
machine uses the stop-the-worldlet approach and is in use
mutators and the collector. When the mutator modifies the
now by customers, who are achieving good results on 2- to
heap, it must inform the collector so that its tracing of the
4-way multiprocessors.
heap does not miss objects (via the write barrier). Similarly,
In the long term, we seek to develop truly scalable algo-
when the collector moves an object (to reduce fragmentation
rithms. We have begun work on an almost wait-free algo-
and bound space consumption), it must inform the mutator
rithm, called Staccato, which will not only greatly increase
so that it sees the new version of the object (via the read
the scalability of the collector but also allow further reduc-
barrier).
tion in latency: the ability of mutators to run in parallel
To extend the Metronome algorithm to multiprocessors,
with any phase of the collector essentially allows extremely
there are basically two options: perform the work quanta
fast preemption.
synchronously across all processors, or design the algorithm
so that the quanta can proceed concurrently with the muta-
3.2
Priorities, Latency, and Jitter
tors. Our current implementation uses the former approach,
An issue that is often confusing to users is the question
we call stop the worldlet.
of the priority of the garbage collector. They want to set
Stop the worldlet has the advantage of (relative) simplicity
the priority of their real-time threads higher than that of
in that basic atomicity constraints are still enforced. How-
the collector, so that they get serviced quickly even when
ever, it is fundamentally limited in its ability to scale up
collection is in progress. However, if this were done with op-
to large numbers of processors, and in its ability to scale
erating system priorities that really were higher than those
down pause times. This is due to the costs of barrier syn-
of the collector, then the collector could be interrupted while
chronization, unevenness in the work estimators, and other
the heap was in an indeterminate state.
fine-grained load balancing issues. However, with careful
As a result, the collector runs at a higher “physical” pri-
engineeraing we are able to achieve very good results on
ority (in the operating system) than Java threads that have
multiprocessors of modest scale (so far, up to 8-way ma-
access to the garbage collected heap. However, users may
chines).
set threads to run at a higher “logical” priority than the
In addition to basic synchronization problems, multipro-
collector, in which case the collector will voluntarily give
cessing also introduces some new issues specific to the real-
up control as soon as it detects that a logical high-priority
time domain. In particular, the real-time guarantee must
thread is ready to run.
take into account the load balancing behavior across pro-
For periodic threads and timers, since the collector knows
cessors, since the collection does not complete until the last
the deadline in advance, it can adjust its work quantum in
processor is finished.
advance so that it finishes in time for the deadline. There
At a theoretical level, this requires the programmer to
is a small amount of jitter in the work estimator, but by
bound another parameter, namely the longest chain of ob-
factoring that jitter into the deadline we are able to meet
jects uniquely reachable from the roots [6]. The tracing of
time-based deadlines with a jitter of ±3µs. The cost is that
such a chain can not be parallelized, and therefore we must
we must spin during the over-provisioning period for the
assume that in the worst case one processor begins process-
work estimator, which is typically 5µs. Although there are
ing the chain just as all of the others finish tracing the rest
occasional longer latencies, they are predictable and the col-
of the heap. In particular, this parameter determines the
lector can schedule around them.
worst-case load balance.
However, the frequency of periodic threads and the la-
The actual quality of load balancing is also determined
tency for event-driven threads is still limited by the inherent
by constant overheads and the trade-off between granularity
latency of the collector.
and synchronization overhead; once again, good engineering
can help stave off the inevitable. In practice, both load
3.3
Latency Reduction
balancing problems become progressively more significant
The lower limit at which real-time garbage collection can
as the system is scaled up.
be applied is determined by the worst-case latency of the
The extra time spent in load balancing does not adversely
system. This is currently about 2 milliseconds, making the
affect latency. It has two significant effects: the first is the
system suitable for periodic tasks down to about 200 Hertz.
delay in completion of a collection, which translates into a
While adequate for a large body of real-time systems, there
higher memory requirement (since the program will continue
are still many systems with lower latency requirements.
to allocate while it waits for the collector to finish). In
The worst-case latency is determined by the largest atomic
practice, this does not appear to be a major problem on
step that the system needs to take. In a garbage collector,
memory-rich multiprocessors.
those steps typically consist of things like scanning the stacks
The second significant effect is on utilization, since all pro-
for roots, scanning the global variables for roots, scanning
cessors are stopped synchronously. However, in some cases
the pointers of an object, moving an object, and so on.
the work of various phases of collection can be overlapped,
In the Metronome, the use of arraylets ensures that both
which reduces this effect. Further improvement can be ob-
object scanning and object relocation are bounded and short
tained by allowing mutators to run concurrently with some
operations. The use of a read barrier and mark-phase pointer

fixup avoids the need for atomically updating all of the
live memory m. An accurate bound would essentially have
pointers to a moved object.
Global variables are write-
to perform an abstract interpretation of the collector at
barriered, so they need not be scanned for root pointers
compile-time, which in general will be unable to provide
(the write barrier records them incrementally).
useful bounds for the kinds of complex programs of interest.
However, in our current implementation, two significant
Currently, for unrestricted programs, we do not see any al-
atomic steps remain: stack processing and finalizer process-
ternative, except empirical methods based on test coverage.
ing. To achieve our current bounds, stack size and finalizer
However, it is important to distinguish the problems in-
usage is limited. However, these are not fundamental prob-
troduced by garbage collection per se from those introduced
lems and we are working to incrementalize them.
by the use of dynamic data structures which are inherent
Stack processing can be incrementalized by the use of
to complex real-time systems. The maximum live memory
stacklets, which partitions the stack into fixed-size chunks
must also be computed in a system built with explicit allo-
which are then processed atomically [9]. This requires a
cation and de-allocation, or in a system using object pooling
modification to the call and return sequences so that stack
from a fixed-size pool, or in a system using scoped memory
overflow on call causes the insertion of a “return barrier”,
in all but the most trivial ways.
which snapshots the pointers of the stacklet below it before
Fundamentally, the complexity of verification comes from
returning to the calling function.
the use of dynamic data structures. The additional com-
With stacklets, the atomic root processing of the collector
plexity introduced by garbage collection is that the alloca-
is limited to scanning the top stacklets of the currently run-
tion rate must also be considered. However, this must be
ning threads. We expect this to take in the neighborhood
balanced against the reduction in complexity and the im-
of 50 microseconds on current hardware.
provement in reliability provided by automatic garbage col-
The other problem is finalizer processing, or more pre-
lection.
cisely the processing of all of Java’s “strange” pointers, which
include not only unreachable finalized objects, but also weak,
3.5
Verifying the Collector
soft, and phantom references. This is not inherently difficult
The other aspect of correctness is that of the collector
to incrementalize, but requires that all java.lang.ref types are
implementation itself. A real-time concurrent garbage col-
implemented with a double indirection so that the require-
lector is a very complex subsystem of the virtual machine,
ment for atomic clearing can be met in unit time.
and the algorithms involved are notoriously difficult to prove
More problematic, however, is how to implement soft ref-
correct. Since the collector now forms part of the trusted
erences, whose semantics almost require a stop-the-world
computing base of a critical system, verification becomes
garbage collection: “ All soft references to softly-reachable
increasingly important.
objects are guaranteed to have been cleared before the vir-
The study of concurrent garbage collectors began with
tual machine throws an OutOfMemoryError”. If this is done
Steele [23], Dijkstra [11], and Lamport [17]. Concurrent
when memory is truly full, then by the time the collector
collectors were immediately recognized as paradigmatic ex-
makes its last-ditch attempt to reclaim memory, the real-
amples of the difficulty of constructing correct concurrent
time characteristics will have already been violated. The
algorithms. Steele’s algorithm contained an error which he
implementation is free to clear soft references at any time,
subsequently corrected [24], and Dijkstra’s algorithm con-
and in our initial implementation we simply clear them on
tained an error discovered and corrected by Stenning and
each collection.
Woodger [11]. Furthermore, some correct algorithms [4] had
However, a solution which retains the useful properties
informal proofs that were found to contain errors [20].
of soft references is desirable, but it is not clear how to
Many additional incremental and concurrent algorithms
reconcile this with real-time collection. In particular, since
have been introduced over the last 30 years, but there has
we require an upper bound m on the size of the live data in
been very little experimental comparison of the algorithms
the heap, and soft references are explicitly designed to allow
and no formal study of their relative merits. While there
that quantity to be indeterminate. At present we have no
is now a well-established “bag of tricks” for concurrent col-
solution to this problem.
lectors, each algorithm composes them differently based on
the intuition and experience of the designer. Since each al-
3.4
Verifying Application Parameters
gorithm is different, a correctness proof for one algorithm
Metronome’s ability to provide real-time guarantees is
cannot be re-used for others.
contingent on accurate characterization of the maximum live
Previous work on proving the correctness of concurrent
memory m and the maximum long-term allocation rate a.
collectors has applied the proof to the algorithm itself. Since
Clearly the ability to accurately characterize those quanti-
the algorithm is complicated, the proof is as well, and there-
ties is essential to the correctness of the resulting system.
fore subject to error.
Allocation rate is the easier of the two to bound. In fact,
We have taken a different approach: rather than proving
bounding the allocation rate is simpler than computing the
the ultimate algorithm correct, we start with an extremely
WCET of a task, since as Mann et al [19] have shown, it can
simple algorithm which is amenable to a simple proof, and
usually be performed using an analysis that follows worst-
then transform it into a practical, efficient algorithm with
case paths without knowing which ones are taken or how
a series of incremental transformations, each of which can
often. They achieved static bounds that were usually within
also be proved correct [25, 26].
a factor of two of the actual measured allocation rate.
In the process of developing these transformations, we
In some cases it may be necessary to provide loop bounds
have generalized various aspects of concurrent collection, in-
to obtain a sufficiently tight bound on the allocation rate,
cluding the treatment of write barriers as reference count-
but this is also true of WCET analysis.
ing operations, mixing incremental-update and snapshot ap-
The more difficult problem is computing the maximum
proaches in a single collector, and providing a range of tech-

4.1
Handlers
Handlers are designed for tasks that require extremely
low latencies and very high frequencies. They are based
on the principle that as the frequency with which tasks are
Handler
Handler
executed, their complexity of necessity drops.
Object
Handlers are tasks that operate on a pre-allocated data
structure whose pointers are immutable. The run-time sys-
tem pins this data structure so that the garbage collector
Buffer 1
Buffer 2
cannot move it, as shown in Figure 4.
Because Handlers cannot change the shape of the heap,
STACK
HEAP
GLOBAL
and the garbage collector cannot change the location of the
Handler’s objects, Handlers can preempt the garbage col-
Figure 4: Interaction of Handlers with the heap.
lector at any time, even while the invariants of the rest of
Handlers reside in the garbage collected heap, but
the heap are temporarily broken. This allows extremely fast
are pinned (gray objects) during their lifetime.
context switching, limited only by the underlying hardware
They may be referenced by other heap objects, and
and operating system.
will be subject to garbage collection once the han-
Rather than rely on dynamic checks, Handlers make use of
dler exits.
Java’s existing final mechanism, which is statically verified.
However, Handlers have some additional restrictions that
are checked at instantiation time, when the Handler is first
niques for handling newly allocated objects.
loaded and before it is scheduled. A Handler contains a run()
This has resulted not only in new insights, but also in the
method and a set of local variables.
derivation of new algorithms, which have been shown both
At instantiation time the validator checks that the data
empirically [25] and formally [26] to be more precise than
structure reachable from the local variables does not contain
some previous algorithms. In particular, the new algorithm
any non-final pointers, and that code reachable from the
retains the predictable termination property of snapshot al-
run() method does not access any non-final global pointers,
gorithms (which are necessary to achieve real-time guaran-
manipulate threads, or perform any blocking synchroniza-
tees) with the lower memory requirements of incremental-
tion operations.
update algorithms.
If the Handler is valid, then the data structure reachable
We have formalized for the first time the notion of the rela-
from its local variables is pinned: the garbage collector is
tive precision of concurrent collectors, and express the trans-
informed that the objects are temporarily unmovable.
formations as correctness-preserving and precision-reducing.
The Handler is now guaranteed to access only final point-
We observe that many precision-reducing transformations
ers of pinned objects.
Because the pointers are final, it
are also concurrency-increasing, and are currently working
will execute no write barriers and there is no mutator-to-
to formalize the notion of relative concurrency of collectors.
collector synchronization; because the objects are pinned,
Ultimately, our goal is to produce the actual code of the
the collector will not move them and there is no collector-
collector via mechanical transformation from the simple,
to-mutator synchronization.
provable algorithm and a selection of provable transforma-
Handlers are well-suited to tasks that perform buffer pro-
tions chosen to provide the desired performance properties.
cessing. For instance, it may be necessary to sample a sensor
at very high frequency. The sample data can then be pro-
4.
SPECIALIZED CONSTRUCTS
cessed by a lower-frequency task that analyzes the sample,
Although real-time garbage collection is able to provide
perhaps performing convolutions and then choosing an actu-
real-time behavior to extremely general code at a fairly high
ator value. High-frequency buffered output can also be used
resolution, as discussed in Section 3.3 there appear to be
to drive devices such as software sound generators, where
fundamental limits on how low that latency can be driven.
the lower-frequency task can create a waveform and then
Furthermore, in some situations, it is desirable to use a
pass it in a buffer to a Handler.
more restrictive programming model to enforce certain re-
Figure 4 shows a canonical Handler that uses double-
source constraints or to improve the level of static verifia-
buffering. The buffers are exchanged between the Handler
bility of the system.
and the garbage collected tasks using non-blocking queues.
In this section we describe work in progress on the design
A prototype implementation shows that Handlers are likely
and implementation of two such constructs, called Handlers
to achieve latencies of a few microseconds on stock hardware
and E-Tasks. Unlike the scoped memory construct of the
and operating systems.
Real-Time Specification for Java [7], these constructs are
By comparison with RTSJ’s scoped memory coupled with
free of run-time exceptions and modularity-violating inter-
NoHeapRealTimeThread’s [7], Handlers are both more reli-
face constraints. They are designed to fit the natural pro-
able, since they throw no run-time memory access errors,
gramming idioms of real-time systems, rather than being
and more efficient, since they require neither run-time checks
designed to avoid garbage collection.
nor run-time scope entry and exit (anecdotal evidence sug-
Taken together, Handlers and E-Tasks are likely to en-
gests that scope entry and exit costs about 16µs).
tirely eliminate the need for low-level mechanisms like RTSJ’s
Handlers are more restrictive than scopes in that they do
scoped and immortal memory, and replace them with safe,
not allow dynamic memory allocation, but less restrictive in
high-level constructs that are compatible with the standard
that they do allow references from the heap. The result is
Java language while requiring minimal changes to the vir-
a more reliable, higher-performance programming construct
tual machine.
that better matches programmer needs.

4.2
E-Tasks
Many real-time systems decompose naturally into a set of
tasks that communicate solely by message passing. Such a
decomposition provides a very high level of determinism and
reliability because each task is purely functional in its inputs
e = at
g'
g'
and the task abstraction matches the sensor-to-actuator con-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
trol flow of many systems.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
E-Tasks provide such a task model within Java, adapted
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
from that of Giotto [13], but extending it to allow individ-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ually garbage collected tasks and the communication of ar-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
m
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bitrarily complex values over ports.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Like Handlers, E-Tasks rely on instantiation-time valida-
t
t'
tion. They share some of the same restrictions, but neither
one is a subset of the other. E-Tasks are more restrictive in
(a) Original E-Task
(b) Garbage-Collected E-Task
that they may not observe any global mutable state, nor may
the heap contain references to objects inside of E-Tasks. On
Figure 5: Trading Space for Time in Task Execution.
the other hand, E-Tasks are less restrictive in that they may
Time is the horizontal axis, space is the vertical axis.
allocate memory and mutate their pointer data structures.
m is the base memory of the task, t is its execution
Unlike Handlers, E-Tasks receive new objects from exter-
time, and a is its allocation rate.
nal sources (via ports). If those sources include types not
previously seen by the E-Task, they could cause previously
un-validated code to be executed in overridden methods. As
can be introduced to trade space for time. Garbage collec-
a result, the E-Task validator must check not only its given
tion is triggered by setting the limit of the private heap to
task, but also ensure that any changes to the call graphs of
some value between m and s; this is called its space-out.
preexisting tasks are benign.
The ability to make these space/time tradeoffs introduces
E-Tasks and their ports also implement the logical execu-
the potential for sophisticated scheduling algorithms that
tion time abstraction of Giotto, which provides platform-
consider not only time, but also space. Although they would,
independent programming of the timing behavior of the
in general, be too expensive to perform online, it may be pos-
task. The result is an extension of Java’s principle of “write
sible to compute them in advance and validate them with a
once, run anywhere” from the functional domain to the tim-
witness in the manner of schedule-carrying code [14].
ing domain.
As Sagonas and Wilhelmsson have discussed in the con-
The logical execution time (LET) of a task determines
text of Erlang (which has a similar task model) there are
how long the task executes in real time to produce its re-
tradeoffs between using a global heap in which read-only
sult, independently of any platform characteristics such as
objects are shared between tasks, private heaps which are
system utilization and CPU speed. Then, we check statically
individually collected, and a hybrid of the two [22]. E-Tasks
and dynamically if the task does indeed execute within its
present the same implementation choices; we concentrate on
LET for a given platform, e.g., characterized by a scheduling
the use of private heaps for the level of accountability they
strategy and WCETs. If the task needs less time, its output
provide, but there is nothing about the design that precludes
is buffered and only made available exactly when its LET
the other approaches.
has elapsed. As a result, the behavior of the system is both
One of the major problems with the message-based model
platform independent (assuming sufficient resources to com-
of E-Tasks is that it appears to require that data structures
plete on time) and composable (since two LET-based sets
sent on ports be immutable; otherwise a data structure read
of tasks can be composed without changing their external
from the same port by two tasks, or sent on a port and
behavior).
(perhaps partially) retained by reference in the task, would
As we illustrated in Figure 3, the Metronome real-time
be subject to mutation that could cause side effects.
collector allows a tradeoff between space and time based on
Such a restriction is undesirable because it would signif-
the available CPU and memory resources. With E-Tasks,
icantly limit the flexibility of the data structures and the
we can extend this notion to a finer granularity.
ability to use pre-existing libraries to create and process
This is shown in Figure 5(a): we start by considering the
them. However, it is not the mutability of the data structure
execution of the E-Task in the absence of garbage collection.
that is the fundamental problem, but rather the potential
It runs for time t, has a base memory m of permanent data
for sharing.
structures, and allocates memory at rate a. As a result the
We solve this problem with send-by-collection: at the end
extra memory allocated by the task is e = at, and the total
of the task execution, a specialized garbage collector copies
space required for the task is s = m + at.
the objects in ports from the sender to the receiver. In the
However, if we wish to reduce the task’s memory consump-
event that it knows, either statically or dynamically, that
tion we can interpose intermediate garbage collections which
there is no sharing, it may be possible to optimize that op-
will temporarily reduce the memory utilization back to m.
eration. However, if there is sharing, then multiple receivers
As a result, the task will require additional time (g per
will each receive their own copy of the mutable data, and
collection) but consume less space, as shown in Figure 5(b).
there will be no side effects.
The difference between a task’s WCET and its deadline is
In addition to allowing the use of mutable data structures,
called its slack. By analogy, we call the difference between
send-by-collection means that E-Tasks can even make use
a tasks base memory and its allocated memory its slop. De-
of libraries that invoke synchronized methods because all
pending on the amount of slack and slop, garbage collections
data is guaranteed to be task-local and the synchronizations

are therefore guaranteed to be redundant. The specialized
5.2
Trace Visualization: The Tuning Fork
collection simply removes any synchronization state from
The trace facility in the virtual machine provides the raw
the copied objects that are sent to other tasks.
data, but it is also necessary to monitor and analyze that
information. TuningFork is a trace visualization tool that is
built as an Eclipse plug-in. TuningFork itself also exports a
5.
ANALYSIS AND VISUALIZATION
plug-in based architecture, so that the trace format itself is
Complex real-time systems are, well, complex. Therefore,
user-definable.
it is crucial to be able to understand their behavior. The
The various visualizations and analyses are also structured
Metronome system supports this through an efficient, accu-
as plug-ins, allowing a great deal of flexibility. We are in the
rate trace facility and a visualizer called TuningFork.
process of converting our off-line statistical analysis tools
to the TuningFork architecture, and are investigating the
5.1
Trace Generation
possibility of allowing the same trace analysis plug-ins to
Although tracing tools are commonplace (albeit not as
be run inside of the virtual machine that is gathering the
common as they should be), the generation and analysis of
traces, so that it can be self-monitoring.
traces in a real-time system poses certain challenges.
A screen shot of TuningFork is shown in Figure 6. It
Beginning with the earliest versions of the system we in-
provides both time-series and histogram views of data, as
corporated a cycle-accurate trace facility into the virtual
well as a view of the ring buffer which contains chunks of
machine. Trace events are usually fixed-size 16-byte records
data from the different streams that make up the trace.
consisting of an 8-byte timestamp (cycle count) and an 8-
In the future, we plan to add an oscilloscope-style view for
byte data field.
visualizing very high-frequency events, and a heap density
The trace facility is designed to be efficient enough to be
view in the style of GCspy [21].
run in “always on” mode, although command line options
As on the producer side, TuningFork must also be struc-
are provided to disable it, and a build-time option allows us
tured as a fault-tolerant real-time system. It must itself
to generate a virtual machine that does not even contain the
be prepared to discard buffers and to analyze data that is
extra conditional tests.
incomplete or not well-ordered.
The trace subsystem must be able to execute without in-
In order to provide a unified view of the information from
terfering with the real-time behavior of the rest of the sys-
different parts of the system, the various streams (from dif-
tem. It therefore itself takes on many of the properties of a
ferent CPU’s, threads, etc.) are merged into a single, sorted
real-time system. Its operations must be bounded and lock-
logical stream. There is a trade-off between the complete-
free. Therefore it may have to abandon buffers when the
ness of the stream and the timeliness with which it can be
filesystem or socket is not draining the data quickly enough,
delivered; the system uses a paramterizable time window
or when the virtual machine is producing trace events too
within which it must receive data from all streams before
quickly.
sorting and producing the result. This means that data ar-
When the system is extended to multiprocessors, the com-
riving after the window will be discarded. In Figure 6, this is
plexity of tracing increases considerably. First of all, most
shown in the ring-buffer diagram in TuningFork’s lower left
systems (with the exception of the IBM 390 [15]) do not have
pane, where the darkened set of buffers forms the “merge
a globally synchronized high-resolution clock. In fact, there
window”.
is both skew and drift. On a single board this is relatively
Essentially, delay and loss are simply two sides of the same
low, since the processors typically share a single oscillator.
coin. They also occur at both the beginning and end of the
However, across boards the effect can become considerable.
merged in-memory events that can displayed by the visual-
Variations are caused by both static phenomena such as the
izer, since when old buffers are discarded the data is lost in
use of different chips, and dynamic phenomena such as tem-
an order different from the sorted order.
perature variation.
While we perform a clock synchronization at startup and
periodically piggyback clock synchronization on other syn-
6.
PROBABILISTIC WCET
chronization events, there is always some level of uncertainty
In all of our work on enabling the use of Java for pro-
in the measurements.
gramming complex real-time systems, a recurring theme is
As a result, any trace analysis or visualization facility
the fact that various assumptions are being made about the
must be prepared to deal with both incomplete and inac-
average behavior of the system.
curate data.
This also places a premium on trace data
This is true not only in the use of the long-term average
that avoid dependence on previous entries. For instance,
allocation rate to derive a bound for real-time garbage col-
we initially recorded allocation and freeing with incremen-
lection, but is also implicit in the use of modern hardware
tal amounts, but this led to incorrect results if even one
with its numerous sources of non-determinism and unpre-
record was lost, so we changed them to use absolute num-
dictability due to such things as caches, branch prediction,
bers. For events that are of necessity stateful, we include
hyperthreading, and so on.
sequence numbers to allow the detection of dropped events
Traditional approaches to the design of real-time systems
and reconstruction of partial or estimated results.
seek to maximize determinism to provide firm guarantees
The system is currently being extended to allow user-
that tasks will complete within their deadlines. This ap-
defined events and their insertion from Java code. This will
proach is enshrined in the concept of worst-case execution
allow users of the system to correlate application-level and
time (WCET).
virtual machine events. With suitable kernel extensions op-
However, WCET analysis typically has to make many
erating system events can be incorporated as well, allowing
conservative assumptions, which leads to significant over-
complete vertical profiling.
provisioning. With the current differential between cache

Figure 6: Screenshot of the Tuning Fork tool.
and main memory access, the level of over-provisioning is
the amount of over-provisioning required to reach extremely
often so high as to be useless.
high levels of confidence is surprisingly small. Our approach
WCET also drives a design methodology, in both hard-
is to replace a deterministic WCET with a bound whose
ware and software, that places a strong emphasis on deter-
probability of failure is below the mean time between fail-
ministic execution time of operations rather than on opti-
ure (MTBF) of the physical components in which the real-
mization for average-case execution time as is done in other
time software is embedded. We call this Probabilistically
branches of computer science.
Analyzed WCET, or PAWCET.
With the new generation of complex real-time systems,
Our approach becomes more and more attractive as the
the amount of variability increases and the WCET method-
number of non-deterministic operations increases. Thus it is
ology does not scale up to the resulting levels of complexity.
well-suited for example to tasks with a 10 millisecond dead-
We are investigating a different approach: real-time sys-
line running on a 1 GHz processor, where 10 milliseconds
tems are composed of components in which there are many
might comprise 10, 000, 000 instructions, 2, 000, 000 memory
sources of nondeterminism. In fact, assuming that they are
accesses, 5000 dynamic memory allocations, and so on. On
statistically independent, it is actually better to have more
the other hand, it is not so well suited for a 1 microsecond
sources of nondeterminism rather than less. Our approach
regime, or for a 10 millisecond regime on an 8-bit microcon-
shares some features with that of Bernat et al [5], which
troller running at 1 MHz.
applied such techniques to basic block profiles.
However, more operation-rich regimes are exactly those to
By using many statistically independent sources of non-
which we wish to apply our methodology, since very short-
determinism, we can analytically determine the amount of
running programs are by their nature simpler to analyze us-
over-provisioning required to meet an arbitrary level of con-
ing deterministic approaches (such as the technique of Fer-
fidence that the task will complete within the given time.
dinand et al [12], which can tightly bound the costs due
Since the variance of a single type of event drops as the
to cache, branch prediction, etc. for programs restricted to
total number of events becomes large, and the variance of
bounded loops over static data structures).
the composition of independent events drops super-linearly,
The probabilities can be analyzed using relatively recent

results in the statistics of large deviations [10]. We have
and principled probabilistic analysis allows the complexity
used these techniques to derive confidence formulae, and
to be controlled.
for determining the range of applicability of the formulae.
We believe that these and other advances will lead to the
In particular, for formulae that apply to a large number of
widespread adoption of Java for real-time programming in
events, we quantify what constitutes a “large” number. This
the coming years.
number will be the inflection point below which traditional
deterministic WCET methods must be used.
Acknowledgments
Our approach provides both a design methodology and
Much of this work was done in the IBM J9 virtual machine,
a quantitative method of analysis. The design methodol-
and would not have been possible without the use of that
ogy is to make abundant use of optimizations for improving
infrastructure or the assistance of the J9 team, in particu-
average-case performance, but to limit the variance of the
lar Pat Dubroy, Mike Fulton, and Mark Stoodley. We also
worst-case performance, and to maximize the independence
thank Bob Blainey, Tom Henzinger, En-Kuang Lung, and
of the worst-case events from other optimized operations in
Greg Porpora for many useful discussions.
the system.
The PAWCET analysis takes information about the num-
ber of nondeterministic events, their probabilities, and their
8.
REFERENCES
degree of correlation, and provides an execution time esti-
[1] Bacon, D. F., Cheng, P., and Rajan, V. T. Con-
mate that will be met with a given degree of confidence.
trolling fragmentation and space consumption in the
By allowing a much wider scope for optimizations, the
Metronome, a real-time garbage collector for Java. In
tasks will execute more quickly, which in itself makes it more
Proceedings of the Conference on Languages, Compil-
likely that they will be able to meet their deadlines. Even
ers, and Tools for Embedded Systems (San Diego, Cal-
more importantly it allows programmers to write simpler
ifornia, June 2003). SIGPLAN Notices, 38, 7, 81–92.
code, which will result in a corresponding increase in relia-
bility.
[2] Bacon, D. F., Cheng, P., and Rajan, V. T. A
Of course, the Achilles’ heel of any statistical approach
real-time garbage collector with low overhead and con-
is unexpected correlation. PAWCET is only as good as the
sistent utilization. In Proceedings of the 30th Annual
probability estimates upon which it is based. Thus another
ACM SIGPLAN-SIGACT Symposium on Principles of
design principal is that systems should be designed to be
Programming Languages (New Orleans, Louisiana, Jan.
resilient to correlation. For example, set-associative caches
2003). SIGPLAN Notices, 38, 1, 285–298.
drastically reduce the execution time variance due to long-
[3] Baker, H. G. List processing in real-time on a serial
term correlations in cache misses.
computer. Commun. ACM 21, 4 (Apr. 1978), 280–294.
While probabilistic techniques have their limitations, as
[4] Ben-Ari, M. Algorithms for on-the-fly garbage collec-
real-time systems increase in complexity there will be no
tion. ACM Trans. Program. Lang. Syst. 6, 3 (1984),
other viable approach. In the long term, we expect the
333–344.
methodology used to validate complex real-time systems
[5] Bernat, G., Colin, A., and Petters, S. M. WCET
will combine static analysis, measurement, and probabilistic
analysis of probabilistic hard real-time system. In IEEE
analysis.
Real-Time Systems Symposium (2002), pp. 279–288.
[6] Blelloch, G. E., and Cheng, P. On bounding time
7.
CONCLUSIONS
and space for multiprocessor garbage collection. In
Proc. of the ACM SIGPLAN Conference on Program-
The growth in complexity of real-time systems has caused
ming Language Design and Implementation (Atlanta,
existing methodologies based around small, simple systems
Georgia, June 1999). SIGPLAN Notices, 34, 5, 104–
with totally deterministic behavior to break down. The sit-
117.
uation is similar to that faced by hardware designers with
the advent of VLSI some twenty-five years ago.
[7] Bollella, G., Gosling, J., Brosgol, B. M., Dib-
Spurred by the advent of real-time garbage collection, in
ble, P., Furr, S., Hardin, D., and Turnbull, M.
conjunction with static compilation and the scheduling fa-
The Real-Time Specification for Java. The Java Series.
cilities of RTSJ, Java has reached a critical inflection point
Addison-Wesley, 2000.
in its usability and credibility for the construction of large,
[8] Brooks, R. A. Trading data space for reduced time
complex real-time systems.
and code space in real-time garbage collection on stock
Continuing reduction in worst-case latency of garbage col-
hardware. In Conference Record of the 1984 ACM Sym-
lection, coupled with increased scalability for multiproces-
posium on Lisp and Functional Programming (Austin,
sors, will cause the domain of applications not amenable to
Texas, Aug. 1984), G. L. Steele, Ed., pp. 256–262.
garbage collection to grow ever smaller.
[9] Cheng, P., and Blelloch, G. A parallel, real-time
For those applications with the shortest and most critical
garbage collector. In Proc. of the SIGPLAN Conference
timing constraints, specialized constructs such as Handlers
on Programming Language Design and Implementation
and E-Tasks promise to provide extremely low latency and
(Snowbird, Utah, June 2001). SIGPLAN Notices, 36, 5
very high predictability, while maintaining Java’s high level
(May), 125–136.
of abstraction and its strong guarantees of safety and secu-
[10] Dembo, A., and Zeitouni, O. Large Deviations:
rity.
Techniques and Applications, second ed., vol. 38 of
The complexity of the systems involved means that abso-
Stochastic Modelling and Applied Probability. Springer-
lute determinism is precluded by the undecidability of the
Verlag, 1998.
analysis problems. Providing sophisticated tools for analy-
[11] Dijkstra, E. W., Lamport, L., Martin, A. J.,
sis and visualization allows the complexity to be understood,
Scholten, C. S., and Steffens, E. F. M. On-the-fly

garbage collection: an exercise in cooperation. Com-
[20] Pixley, C. An incremental garbage collection algo-
mun. ACM 21, 11 (1978), 966–975.
rithm for multi-mutator systems. Distributed Comput-
[12] Ferdinand, C., Heckmann, R., Langenbach, M.,
ing 6, 3 (Dec. 1988), 41–49.
Martin, F., Schmidt, M., Theiling, H., Thesing,
[21] Printezis, T., and Jones, R. GCspy: an adaptable
S., and Wilhelm, R. Reliable and precise WCET de-
heap visualisation framework. In Proc. of the ACM
termination for a real-life processor. In Proc. of the
SIGPLAN Conference on Object-oriented Program-
First International Workshop on Embedded Software
ming, Systems, Languages, and Applications (Seattle,
(Tahoe City, California, Oct. 2001), T. A. Henzinger
Washington, 2002), pp. 343–358.
and C. M. Kirsch, Eds., vol. 2211 of Lecture Notes in
[22] Sagonas,
K.,
and
Wilhelmsson,
J.
Message
Computer Science, pp. 469–485.
analysis-guided allocation and low-pause incremental
[13] Henzinger, T. A., Kirsch, C. M., and Horowitz,
garbage collection in a concurrent language. In Proceed-
B. Giotto: A time-triggered language for embedded
ings of the Fourth International Symposium on Mem-
programming. Proceedings of the IEEE 91, 1 (Jan.
ory Management (Vancouver, British Columbia, 2004),
2003), 84–99.
pp. 1–12.
[14] Henzinger, T. A., Kirsch, C. M., and Matic, S.
[23] Steele, G. L. Multiprocessing compactifying garbage
Schedule-carrying code. In Proc. of the Third Inter-
collection. Commun. ACM 18, 9 (Sept. 1975), 495–508.
national Conference on Embedded Software (Philadel-
[24] Steele, G. L. Corrigendum: Multiprocessing com-
phia, Pennsylvania, Oct. 2003), R. Alur and I. Lee,
pactifying garbage collection. Commun. ACM 19, 6
Eds., vol. 2855 of Lecture Notes in Computer Science,
(June 1976), 354.
pp. 241–256.
[25] Vechev, M. T., Bacon, D. F., Cheng, P., and
[15] IBM Corporation. Enterprise Systems Architec-
Grove, D. Derivation and evaluation of concurrent
ture/390 Principles of Operation, ninth ed., June 2003.
collectors. In Proceedings of the Nineteenth European
[16] Jikes
Research
Virtual
Machine
(RVM).
Conference on Object-Oriented Programming (Glas-
http://jikesrvm.sourceforge.net.
gow, Scotland, July 2005), A. Black, Ed., Lecture Notes
[17] Lamport, L. Garbage collection with multiple pro-
in Computer Science, Springer-Verlag.
cesses: an exercise in parallelism. In Proc. of the 1976
[26] Vechev, M. T., Yahav, E., and Bacon, D. F. Para-
International Conference on Parallel Processing (1976),
metric generation of concurrent collection algorithms.
pp. 50–54.
Submitted for publication, July 2005.
[18] Lee, E. A. What’s ahead for embedded software?
[27] Yuasa, T. Real-time garbage collection on general-
Computer 33, 9 (2000), 18–26.
purpose machines. Journal of Systems and Software 11,
[19] Mann, T., Deters, M., LeGrand, R., and Cytron,
3 (Mar. 1990), 181–198.
R. K. Static determination of allocation rates to sup-
port real-time garbage collection. In Proc. of the ACM
SIGPLAN/SIGBED conference on Languages, Compil-
ers, and Tools for Embedded Systems (Chicago, Illinois,
2005), pp. 193–202.