Original PDF Flash format optimizing-the-migration-of-virtual-computers  


Optimizing The Migration Of Virtual Computers

Optimizing the Migration of Virtual Computers
Constantine P. Sapuntzakis
Ramesh Chandra
Ben Pfaff
Jim Chow
Monica S. Lam
Mendel Rosenblum
Computer Science Department
Stanford University
{csapuntz, rameshch, blp, jimchow, lam, mendel}@stanford.edu
“Beam the computer up, Scotty!”
Abstract
1
Introduction
This paper shows how to quickly move the state of a run-
Today’s computing environments are hard to maintain
ning computer across a network, including the state in its
and hard to move between machines. These environ-
disks, memory, CPU registers, and I/O devices. We call
ments encompass much state, including an operating sys-
this state a capsule. Capsule state is hardware state, so it
tem, installed software applications, a user’s individual
includes the entire operating system as well as applica-
data and profile, and, if the user is logged in, a set of pro-
tions and running processes.
cesses. Most of this state is deeply coupled to the com-
We have chosen to move x86 computer states because
puter hardware. Though a user’s data and profile may be
x86 computers are common, cheap, run the software we
mounted from a network file server, the operating sys-
use, and have tools for migration. Unfortunately, x86
tem and applications are often installed on storage local
capsules can be large, containing hundreds of megabytes
to the computer and therefore tied to that computer. Pro-
of memory and gigabytes of disk data. We have devel-
cesses are tied even more tightly to the computer; very
oped techniques to reduce the amount of data sent over
few systems support process migration. As a result, users
the network: copy-on-write disks track just the updates
cannot move between computers and resume work unin-
to capsule disks, “ballooning” zeros unused memory, de-
terrupted. System administration is also more difficult.
mand paging fetches only needed blocks, and hashing
Operating systems and applications are hard to maintain.
avoids sending blocks that already exist at the remote
Machines whose configurations are meant to be the same
end. We demonstrate these optimizations in a prototype
drift apart as different sets of patches, updates, and in-
system that uses VMware GSX Server virtual machine
stalls are applied in different orders.
monitor to create and run x86 capsules. The system tar-
We chose to investigate whether issues including user
gets networks as slow as 384 kbps.
mobility and system administration can be addressed by
Our experimental results suggest that efficient capsule
encapsulating the state of computing environments as
migration can improve user mobility and system man-
first-class objects that can be named, moved, and oth-
agement. Software updates or installations on a set of
erwise manipulated. We define a capsule for a machine
machines can be accomplished simply by distributing a
architecture as the data type encapsulating the complete
capsule with the new changes. Assuming the presence of
state of a (running) machine, including its operating sys-
a prior capsule, the amount of traffic incurred is commen-
tem, applications, data, and possibly processes. Capsules
surate with the size of the update or installation package
can be bound to any instance of the architecture and be
itself. Capsule migration makes it possible for machines
allowed to resume; similarly, they can be suspended from
to start running an application within 20 minutes on a
execution and serialized.
384 kbps link, without having to first install the applica-
A computer architecture need not be implemented in
tion or even the underlying operating system. Further-
hardware directly; it can be implemented in software us-
more, users’ capsules can be migrated during a commute
ing virtual machine technology[12]. The latter option is
between home and work in even less time.
particularly attractive because it is easier to extract the
state of a virtual computer. Virtual computer states are

themselves sometimes referred to as “virtual machines.”
availability of similar capsules on local machines.
We introduce the term “capsule” to distinguish the con-
To speed up the transfer of capsules and reduce the start-
tents of a machine state as a data type from the machinery
up times on slow networks, our system works as follows:
that can execute machine code. After all, we could bind
these machine states to real hardware and not use virtual
1. Every time we start a capsule, we save all the up-
machines at all.
dates made to disk on a separate disk, using copy-
To run existing software, we chose the standard x86
on-write. Thus, a snapshot of an execution can be
architecture[11, 32] as the platform for our investigation.
represented with an incremental cost commensurate
This architecture runs the majority of operating systems
with the magnitude of the updates performed.
and software programs in use today. In addition, com-
2. Before a capsule is serialized, we reduce the mem-
mercial x86 virtual machine monitors are available, such
ory state of the machine by flushing non-essential
as VMware GSX Server (VMM)[28] and Connectix Vir-
data to disk. This is done by running a user “bal-
tual PC[7], that can run multiple virtual x86 machines on
loon” process that acquires memory from the op-
the same hardware. They already provide the basic func-
erating system and zeros the data. The remaining
tions of writing out the state of a virtual x86 machine,
subset of memory is transferred to the destination
binding the serialized state onto a virtual machine, and
machine and the capsule is started.
resuming execution.
3. Instead of sending the entire disk, disk pages are
The overall goal of our research project is to explore the
fetched on demand as the capsule runs, taking full
design of a capsule-based system architecture, named the
advantage of the operating system’s ability to toler-
Collective, and examine its potential to provide user mo-
ate disk fetch latencies.
bility, recovery, and simpler system management. Com-
4. Collision-resistant hashes are used to avoid sending
puters and storage in the Collective system act as caches
pages of memory or disk data that already exist at
of capsules. As users travel, the Collective can move
the destination. All network traffic is compressed
their capsules to computers close to them, giving users a
with gzip[8].
consistent environment. Capsules could be moved with
users as they commute between home and work. Cap-
We have implemented all the optimizations described in
sules can be duplicated, distributed to many different ma-
this paper in a basic prototype of our Collective sys-
chines, and updated like any other data; this can form the
tem. Our prototype’s platform uses VMware GSX Server
basis for administering a group of computers. Finally,
2.0.1 running on Red Hat Linux 7.3 (kernel 2.4.18-10) to
capsules can be moved among machines to balance loads
execute x86 capsules. Users can retrieve their capsules
or for fail-over.
by name, move capsules onto a file system, start cap-
sules on a computer, and save capsules to a file system.
We have run both Linux and Windows in our capsules.
1.1
Storing and Migrating Capsules
Our results show that we can move capsules in 20 min-
Many challenges must be addressed to realize our goals
utes or less across 384 kbps DSL, fast enough to move
of the Collective project, but this paper focuses on one
users’ capsules between home and work as they com-
simple but crucial one: can we afford the time and space
mute. Speed improves when an older version of the cap-
to store, manipulate and migrate x86 capsules? x86 cap-
sule is available at the destination. For software dis-
sules can be very large. An inactive capsule can con-
tribution, we show that our system sends roughly the
tain gigabytes of disk storage, whereas an active capsule
same amount of data as the software installer package for
can include hundreds of megabytes of memory data, as
newly installed software, and often less for upgrades to
well as internal machine registers and I/O device states.
already installed software. The results suggest that cap-
Copying a gigabyte capsule over a standard 384 kbps
sule migration offers a new way to use software where
DSL link would take 6 hours!
Clearly, a straightfor-
machines can start running a new application within a
ward implementation that copies the entire capsule be-
few minutes, with no need to first install the application
fore starting its computation would take too long.
or even its underlying operating system.
We have developed a number of optimizations that re-
1.2
Paper Organization
duce capsules’ storage requirements, transfer time and
start-up time over a network. These techniques are invis-
Section 2 describes how we use a virtual machine moni-
ible to the users, and do not require any modifications to
tor to create and resume capsules. Section 3 motivates
the operating system or the applications running inside
the need for optimizations by discussing the intended
it. Our techniques target DSL speeds to support capsule
uses of capsules. Section 4 discusses the optimizations
migration to and from the home, taking advantage of the
we use to reduce the cost of capsules. In Section 5 we

describe some experiments we performed on a prototype
3
Usages and Requirements
of our system. The paper discusses related work in Sec-
tion 6 and concludes in Section 7.
The Collective system uses serialization and mobility of
capsules to provide user mobility, backup, software man-
agement and hardware management. We describe each
of these applications of capsules and explain their re-
2
Virtual Machine Monitors
quirements on capsule storage and migration.
3.1
User Mobility
A virtual machine monitor is a layer of software that
sits directly on the raw hardware and exports a virtual
Since capsules are not tied to a particular machine, they
machine abstraction that imitates the real machine well
can follow users wherever they go. Suppose a user wants
enough that software developed for the real machine also
to work from home on evenings and weekends. The user
runs in the virtual machine. We use an x86 virtual ma-
has a single active work capsule that migrates between
chine monitor, VMware GSX Server, to generate, serial-
a computer at home and one at work. In this way, the
ize, and execute our x86 capsules.
user can resume work exactly where he or she left off,
Virtual machine monitors have several properties that
similar to the convenience provided by carrying a lap-
make them ideal platforms for supporting capsules. The
top. Here, we assume standard home and office work-
monitor layer encapsulates all of the machine state nec-
loads, like software engineering, document creation, web
essary to run software and mediates all interactions be-
browsing, e-mail, and calendar access. The system may
tween software and the real hardware. This encapsula-
not work well with data-intensive applications, such as
tion allows the monitor to suspend and disconnect the
video editing or database accesses.
software and virtual device state from the real hardware
To support working from home, our system must work
and write that machine state to a stream. Similarly, the
well at DSL or cable speeds. We would like our users
monitor can also bind a machine state to the real hard-
to feel that they have instantaneous access to their ac-
ware and resume its execution. The monitor requires no
tive environments everywhere. It is possible to start up
cooperation from the software running on the monitor.
a capsule without having to entirely transfer it; after all,
Migration is made more difficult by the myriad of hard-
a user does not need all the data in the capsule immedi-
ware device interfaces out there. GSX Server simplifies
ately. However, we also need to ensure that the capsule is
migration by providing the same device interfaces to the
responsive when it comes up. It would frustrate a user to
virtual machine regardless of the underlying hardware;
get a screen quickly but to find each keystroke and mouse
virtualization again makes this possible. For example,
click processed at glacial speed.
GSX Server exports a Bus Logic SCSI adapter and AMD
Fortunately, in this scenario, most of the state of a user’s
Lance Ethernet controller to the virtual machine, inde-
active capsule is already present at both home and work,
pendent of the actual interface of disk controller or net-
so only the differences in state need to be transferred dur-
work adapter. GSX in turn runs on a host operating sys-
ing migration. Furthermore, since a user can easily ini-
tem, currently Linux or Windows, and implements the
tiate the capsule migration before the commute, the user
virtual devices using the host OS’s devices and files.
will not notice the migration delay as long as the capsule
Virtual hard disks are especially powerful. The disks can
is immediately available after the commute.
be backed not just by raw disk devices but by files in the
host OS’s file system. The file system’s abilities to easily
3.2
Backups
name, create, grow, and shrink storage greatly simplify
the management of virtual hard disks.
Because capsules can be serialized, users and system ad-
ministrators can save snapshots of their capsules as back-
Still, some I/O devices need more than simple conver-
ups. A user may choose to checkpoint at regular intervals
sion routines to work. For example, moving a capsule
or just before performing dangerous operations. It is pro-
that is using a virtual network card to communicate over
hibitively expensive to write out gigabytes to disk each
the Internet is not handled by simply remapping the de-
time a version is saved. Again, we can optimize the stor-
vice to use the new computer’s network card. The new
age by only recording the differences between successive
network card may be on a network that is not able to
versions of a capsule.
receive packets for the capsule’s IP address. However,
since the virtualization layer can interpose on all I/O, it
3.3
System Management
can, transparent to the capsule, tunnel network packets to
and from the capsule’s old network over a virtual private
Capsules can ease the burden of managing software and
network (VPN).
hardware. System administrators can install and main-

tain the same set of software on multiple machines by
shots from the same execution or a series of software up-
simply creating one (inactive) capsule and distributing it
grades, are expected to be found on machines in a Col-
to all the machines. This approach allows the cost of sys-
lective system. Ideally, the cost of storing or transferring
tem administration to be amortized over machines run-
a capsule, given a similar version of the capsule, should
ning the same configuration.
be proportional to the size of the difference between the
This approach shares some similarities with the concept
two. Also, we observe that the two largest components
of disk imaging, where local disks of new machines are
in a capsule, the memory and the disk, are members of
given some standard pre-installed configuration. Disk
the memory hierarchy in a computer, and as such, many
imaging allows each machine to have only one config-
pre-existing management techniques can be leveraged.
uration. On the other hand, our system allows multiple
Specifically, we have developed the following four opti-
capsules to co-exist on the same machine. This has a
mizations:
few advantages: It allows multiple users with different
requirements to use the same machine, e.g. machines in
1. Reduce the memory state before serialization.
a classroom may contain different capsules for different
2. Reduce the incremental cost of saving a capsule
classes. Also, users can use the same machine to run dif-
disk by capturing only the differences.
ferent capsules for different tasks. They can have a single
3. Reduce the start-up time by paging disk data on de-
customized capsule each for personal use, and multiple
mand.
work capsules which are centrally updated by system ad-
4. Decrease the transfer time by not sending data
ministrators. The capsule technique also causes less dis-
blocks that already exist on both sides.
ruption since old capsules need not be shut down as new
capsules get deployed.
4.1
Ballooning
Moving the first capsule to a machine over the network
can be costly, but may still be faster and less laborious
Today’s computers may contain hundreds of megabytes
than downloading and installing software from scratch.
of memory, which can take a while to transfer on a DSL
Moving subsequent capsules to machines that hold other
link. One possibility to reduce the start-up time is to fetch
capsules would be faster, if there happen to be similari-
the memory pages as they are needed. However, operat-
ties between capsules. In particular, updates of capsules
ing systems are not designed for slow memory accesses;
naturally share much in common with the original ver-
such an approach would render the capsule unresponsive
sion.
at the beginning. The other possibility is to flush non-
We can also take advantage of the mobility of capsules
essential data out of memory, transfer a smaller working
to simplify hardware resource management. Rather than
set, and page in the rest of the data as needed.
having the software tied to the hardware, we can select
We observe that clever algorithms that eliminate or page
computing hardware based on availability, load, location,
out the less useful data in a system have already been im-
and other factors. In tightly connected clusters, this mo-
plemented in the OS’s virtual memory manager. Instead
bility allows for load balancing. Also, migration allows
of modifying the OS, which would require an enormous
a machine to be taken down without stopping services.
amount of effort per operating system, we have chosen
On an Internet scale, migration can be used to move ap-
to use a gray-box approach[2] on this problem. We trick
plications to servers that are closer to the clients[3].
the OS into reclaiming physical memory from existing
processes by running a balloon program that asks the OS
3.4
Summary
for a large number of physical pages. The program then
zeros the pages, making them easily compressible. We
The use of capsules to support user mobility, backup,
call this process “ballooning,” following the term intro-
and system management depends on our ability to both
duced by Waldspurger[29] in his work on VMware ESX
migrate capsules between machines and store them effi-
server. While the ESX server uses ballooning to return
ciently. It is desirable that our system works well at DSL
memory to the monitor, our work uses ballooning to zero
speed to allow capsules be migrated to and from homes.
out memory for compression.
Furthermore, start-up delays after migration should be
minimized while ensuring that the migrated capsules re-
Ballooning reduces the size of the compressed memory
main responsive.
state and thus reduces the start-up time of capsules. This
technique works especially well if the memory has many
4
Optimizations
freed pages whose contents are not compressible. There
is no reason to transfer such data, and these pages are the
Our optimizations are designed to exploit the property
first to be cleared by the ballooning process. Discard-
that similar capsules, such as those representing snap-
ing pages holding cached data, dirty buffers and active

data, however, may have a negative effect. If these pages
University
Capsule Disk
are immediately used, they will need to be fetched on
Department1
Department2
demand over the network. Thus, even though a capsule
Latest Disk
Latest Disk
may start earlier, the system may be sluggish initially.
Student1 capsule
Department2
snapshot
updated capsule
We have implemented ballooning in both the Linux and
Latest Disk
Student2 capsule
Student3 capsule
Latest Disk
Latest Disk
Latest Disk
Windows 2000 operating systems. The actual imple-
mentation of the ballooning process depends on the OS.
Student4 capsule
Student1 capsule
Latest Disk
Windows 2000 uses a local page replacement algorithm,
Latest Disk
which imposes a minimum and maximum working set
Figure 1: An example capsule hierarchy.
size for each process. To be most effective, the Windows
2000 balloon program must ensure its current working
set size is set to this maximum.
trating how it may be used in a university. The root cap-
Since Linux uses a global page replacement algorithm,
sule contains all the software available to all students.
with no hard limits on the memory usage of processes,
The various departments in the university may choose to
a simple program that allocates and zeros pages is suf-
extend the basic capsules with department-specific soft-
ficient. However, the Linux balloon program must de-
ware. The department administrator can update the de-
cide when to stop allocating more memory, since Linux
partment capsule by deriving new child capsules. Stu-
does not define memory usage limits as Windows does.
dents’ files are assumed to be stored on networked stor-
For our tests, the Linux balloon program adopts a simple
age servers; students may use different capsules for dif-
heuristic that stops memory allocation when free swap
ferent courses; power users are likely to maintain their
space decreases by more than 1MB.
own private capsule for personal use. Each time a user
logs in, he looks up the latest department capsule and de-
Both ballooning programs explicitly write some zeros to
rives his own individual capsule. The capsule migrates
each allocated page so as to stop both OSes from map-
with the student as he commutes, and is destroyed when
ping the allocate pages to a single zero copy-on-write
he logs out. Note that if a capsule disk is updated, all
page. In addition, both programs hook into the OS’s
the derived capsules containing custom installed soft-
power management support, invoking ballooning when-
ware have to be ported to the updated capsule. For exam-
ever the OS receives a suspend request from the VMM.
ple, if the University capsule disk is updated, then each
department needs to re-create its departmental capsule.
4.2
Capsule Hierarchies
Capsule hierarchies have several advantages: During the
Capsules in the Collective system are seldom created
migration of a capsule disk, only COW disks that are not
from scratch, but are mostly derived from other capsules
already present at the destination need to be transferred.
as explained in Section 3. The differences between re-
Capsule hierarchies allow efficient usage of disk space
lated capsules are small relative to the total size of the
by sharing common data among different capsule disks.
capsules. We can store the disks in these capsules ef-
This also translates to efficient usage of the buffer cache
ficiently by creating a hierarchy, where each child cap-
of the host OS, when multiple capsules sharing COW
sule could be viewed as inheriting from the parent cap-
disks simultaneously execute on the same host. And fi-
sule with the differences in disk state between parent and
nally, creating a new capsule using COW disks is much
child captured in a separate copy-on-write (COW) virtual
faster than copying entire disks.
disk.
Each COW disk is implemented as a bitmap file and a
At the root of the hierarchy is a root disk, which is a
sequence of extent files. An extent file is a sequence of
complete capsule disk. All other nodes represent a COW
blocks of the COW disk, and is at most 2 GB in size
disk. Each path of COW disks originating from the root
(since some file systems such as NFS cannot support
in the capsule hierarchy represents a capsule disk; the
larger files). The bitmap file contains one bit for each
COW disk at the end of the path for a capsule disk is
16 KB block on the disk, indicating whether the block
its latest disk. We cannot directly run a capsule whose
is present in the COW disk. We use sparse file support
latest disk is not a leaf of the hierarchy. We must first
of Linux file systems to efficiently store large yet only
derive a new child capsule by adding a new child disk
partially filled disks.
to the latest disk and all updates are made to the new
Writes to a capsule disk are performed by writing the
disk. Thus, once capsules have children, they become
data to the latest COW disk and updating its bitmap file.
immutable; this property simplifies the caching of cap-
Reads involve searching the latest COW disk and its an-
sules.
cestor disks in turn until the required block is found.
Figure 1 shows an example of a capsule hierarchy illus-
Since the root COW disk contains a copy of all the

blocks, the search is guaranteed to terminate. Figure 2
4.3
Demand Paging of Capsule Disks
shows an example capsule disk and the chain of COW
disks that comprise it. Note that the COW disk hierarchy
To reduce the start-up time of a capsule, the COW disks
is not visible to the VMM, or to the OS and applications
corresponding to a capsule disk are read page-by-page
inside the capsule; all of them see a normal flat disk as
on demand, rather than being pre-fetched. Demand pag-
illustrated in the figure.
ing is useful because COW disks, especially root disks,
could be up to several gigabytes in size and prefetching
The COW disk implementation interfaces with GSX
these large disks could cause an unacceptable start-up
Server through a shim library that sits between GSX
delay. Also, during a typical user session, the working
Server and the C library.
The shim library inter-
set of disk blocks needed is a small fraction of the total
cepts GSX Server’s I/O requests to disk image files in
blocks on the disk, which makes pre-fetching the whole
VMware’s “plain disk” format, and redirects them to
disk unnecessary. Most OSes have been designed to hide
a local disk server. The plain disk format consists of
disk latency and hence can tolerate the latency incurred
the raw disk data laid out in a sequence of extent files.
during demand paging the capsule disk.
The local disk server translates these requests to COW
disk requests, and executes the I/O operations against the
The implementation of the capsule disk system, includ-
COW disks.
ing demand paging, is shown in Figure 3. The shim li-
brary intercepts all of VMware’s accesses to plain disks
Each suspend and resume of an active capsule creates
and forwards them to a disk server on the local machine.
a new active capsule, and adds another COW layer to its
The disk server performs a translation from a plain disk
disks. This could create long COW disk chains. To avoid
access to the corresponding access on the COW disks of
accumulation of costs in storing the intermediate COW
the capsule. Each COW disk can either be local or re-
disks, and the cost of looking up bitmaps, we have imple-
mote. Each remote COW disk has a corresponding local
mented a promote primitive for shortening these chains.
shadow COW disk which contains all the locally cached
We promote a COW disk up one level of the hierarchy by
blocks of the remote COW disk.
adding to the disk all of its parent’s blocks not present in
its own. We can delete a capsule by first promoting all its
Virtual
children and then removing its latest disk. We can also
VMware
Process
Machine
apply the promotion operations in succession to convert
(one per
a COW disk at the bottom of the hierarchy into a root
capsule)
Shim
disk.
C−library
On a final note, VMware GSX Server also implements a
Local RPC
copy-on-write format in addition to its plain disk format.
However, we found it necessary to implement our own
Local Disk Server Process
(one per compute server)
COW format since VMware’s COW format was complex
and not conducive to the implementation of the hashing
Local
Shadow
optimization described later in the paper.
COW Disks
COW Disks
Bitmap
DiskBlocks
Network RPC
Root
11111111
COW Disk
HCP Server
10010101
COW Disk 1
Remote
COW Disks
11000001
COW Disk 2
Latest
01011000
COW Disk
Figure 3: Implementation of capsule disks and demand paging.
Since the latest COW disk is always local, all writes are
Flat Disk
(as seen by the capsule)
local. Reads, on the other hand, could either be local
or remote. In the case of a remote read, the disk server
requests the block from the shadow COW disk. If the
Block present in COW disk
block is not cached locally, it is fetched remotely and
Figure 2: An example capsule disk.
added to the shadow COW.
Starting a capsule on a machine is done as follows: first,

the memory image and all the bitmaps of the COW disks
Our HCP algorithm is intended for migrating capsules
are transferred if they are not available locally. Then
over low bandwidth links such as DSL. Because HCP in-
the capsule is extended with a new, local latest COW
volves many disk seeks, its effective throughput is well
disk. For each remote COW disk, the corresponding
under 10 Mbps.
Hence, it is not intended for high-
shadow COW disk is created if it does not already ex-
bandwidth LAN environments where the network is not
ist. GSX Server can now be invoked on the capsule.
the bottleneck.
Note that since remote COW disks are immutable, the
cached blocks in the shadow COW disks can be re-used
4.4.1
Hash Cache Design
for multiple capsules and across suspends and resumes of
a single capsule. This is useful since no network traffic
HCP uses a hash cache to map hashes to data. Unlike
is incurred for the cached blocks.
rsync, the cache is persistent; HCP does not need to gen-
erate the table by scanning a file or file system on each
The Collective system uses an LDAP directory to keep
transfer, saving time.
track of the hosts on which a COW disk is present.
In general, COW disks of a capsule disk could be dis-
The cache is implemented using a hash table whose size
tributed across many hosts since they were created on
is fixed at creation. We use the first several bits of the
different hosts. However, the disks are also uploaded
hash key to index into the table. File data is not stored in
(in the background) to a central storage server for bet-
the table; instead, each entry has a pointer to a file and
ter availability.
offset. By not duplicating file data, the cache uses less
disk space. Also, the cache can read ahead in the file,
priming an in-memory cache with data blocks. Read-
4.4
Hash-Based Compression
ahead improves performance by avoiding additional disk
accesses when two files contain runs of similar blocks.
We use a fourth technique to speed up data transfer over
low-bandwidth links. Inspired by the low-bandwidth file
Like LBFS, when the cache reads file data referenced by
system (LBFS[19]) and rsync[27], we decrease transfer
the table, it always checks that it matches the 20-byte
time by sending a hash of data blocks instead of the data
SHA-1 hash provided. This maintains integrity and al-
itself. If the receiver can find data on local storage that
lows for a couple of performance improvements. First,
hashes to the same value, it copies the data from local
the cache does not need to be notified of changes to file
storage. Otherwise, the receiver requests the data from
data; instead, it invalidates table entries when the in-
the server. We call this technique HCP, for Hashed Copy.
tegrity check fails. Second, it does not need to lock on
The Collective prototype uses HCP for demand paging
concurrent cache writes, since corrupted entries do not
disks and copying memory and disk images.
affect correctness. Finally, the cache stores only the first
8 bytes of the hash in each table entry, allowing us to
We expect to find identical blocks of data between disk
store more entries.
images and memories, even across different users’ cap-
sules. First, the memory in most systems caches disk
The hash key indexes into a bucket of entries, currently a
blocks. Second, we expect most users in the Collec-
memory page in size. On a lookup, the cache does a lin-
tive to migrate between a couple of locations, e.g. home
ear search of the entries in a bucket to check whether one
and work. After migrating a couple of times, these loca-
of them matches the hash. On a miss, the cache adds the
tions will contain older memory and disk images, which
entry to the bucket, possibly evicting an existing entry.
should contain blocks identical to those in later images,
Each entry contains a use count that the cache increments
since most users will tend to use the same applications
on every hit. When adding an entry to the cache, the hash
day to day. Finally, most users run code that is distributed
cache chooses a fraction of the entries at random from the
in binary form, with most of this binary code copied un-
bucket and replaces the entry with the lowest use count;
modified into memory when the application runs, and the
this evicts the least used and hopefully least useful entry
same binary code (e.g. Microsoft Office or the Netscape
of the group. The entries are chosen at random to de-
web browser) is distributed to millions of people. As a
crease the chance that the same entry will be overwritten
result, we expect to find common blocks even between
by two parallel threads.
different users’ capsules.
4.4.2
Finding Similar Blocks
Like LBFS, HCP uses a strong cryptographic hash, SHA-
1[1]. The probability that two blocks map to the same
For HCP to compress transfers, the sender and receiver
160-bit SHA-1 hash is negligible, less than the error rate
must divide both memory and disk images into blocks
of a TCP connection or memory[5]. Also, malicious par-
that are likely to recur. In addition, when demand pag-
ties cannot practically come up with data that generates
ing, the operating system running inside the capsule es-
the same hash.
sentially divides the disk image by issuing requests for

blocks on the disk. In many systems, the memory page
as possible.
is the unit of disk I/O and memory management, so we
CLIENT
SERVER
chose memory pages as our blocks.
(a)
Lookup
(c)
Read−Hash
The memory page will often be the largest common unit
file handle
CLIENT
SERVER
between different memory images or between memory
hash
and disk. Blocks larger than a page would contain two
CLIENT
SERVER
Read
adjacent pages in physical memory; since virtual mem-
(b)
Read−Hash
ory can and does use adjacent physical pages for com-
data
hash
pletely different objects, there is little reason to believe
that two adjacent pages in one memory image will be
Figure 4: Typical HCP session: (a) session initiation, (b) hash
adjacent in another memory image or even on disk.
cache hit, (c) hash cache miss.
When copying a memory image, we divide the file into
page-sized blocks from the beginning of the image file.
For disk images, it is not effective to naively chop up the
5
Experimental Results
disk into page-size chunks from the start of the disk; file
Our prototype system is based on VMware GSX Server
data on disk is not consistently page aligned. Partitions
2.0.1 running on Red Hat Linux 7.3 (kernel 2.4.18-12).
on x86 architecture disks rarely start on a page bound-
Except for the shim library, we wrote the code in Java
ary. Second, at least one common file system, FAT, does
using Sun’s JDK 1.4.0. The experiments ran on 2.4 GHz
not start its file pages at a page offset from the start of
Pentium 4 machines with 1GB memory.
the partition. To solve this problem, we parse the par-
tition tables and file system superblocks to discover the
A separate computer running FreeBSD and the
alignment of file pages. This information is kept with the
dummynet[23] shaper simulated a 384 kbps symmetric
disk to ensure we request properly aligned file data pages
DSL link with 20 ms round-trip delay. We confirmed
when copying a disk image.
the setup worked by measuring ping times of 20ms and
a TCP data throughput of 360kbps[20]. We checked the
On a related note, the ext2, FAT, and NT file systems
correctness of our HCP implementation by ensuring that
all default to block sizes less than 4 KB when creating
the hash keys generated are evenly distributed.
smaller partitions; as a result, files may not start on page
boundaries. Luckily, the operator can specify a 4 KB or
We configured the virtual machines to have 256 MB
larger block size when creating the file system.
memory and 4 GB local disk. Along with the operating
system and applications, the local disk stored user files.
Since HCP hashes at page granularities, it does not deal
In future versions of the system, we expect that user files
with insertions and deletions well as they may change
will reside on a network file system tolerant of low band-
every page of a file on disk or memory; despite this, HCP
widths.
still finds many similar pages.
To evaluate our system, we performed the following four
4.4.3
HCP Protocol
experiments:
The HCP protocol is very similar to NFS and LBFS. Re-
1. Evaluated the use of migration to propagate soft-
quests to remote storage are done via remote procedure
ware updates.
call (RPC). The server maintains no per-client state at the
2. Evaluated the effectiveness and interaction between
HCP layer, simplifying error recovery.
ballooning, demand paging, and hash compression.
Figure 4 illustrates the protocol structure. Time increases
3. Evaluated the trade-offs between directly using an
down the vertical axis. To begin retrieving a file, an
active capsule versus booting an inactive capsule.
HCP client connects to the appropriate HCP server and
4. Simulated the scenario where users migrate their
retrieves a file handle using the LOOKUP command, as
capsules as they travel between home and work.
shown in part (a). The client uses READ-HASH to ob-
tain hashes for each block of the file in sequence and
5.1
Software Management
looks up all of these hashes in the hash cache. Blocks
found via the hash cache are copied into the output file,
Software upgrades are a common system administration
and no additional request is needed, as shown in part (b).
task. Consider an environment where a collection of ma-
Blocks not cached are read from the server using READ,
chines maintained to run exactly the same software con-
as in part (c). The client keeps a large number of READ-
figuration and users’ files are stored on network storage.
HASH and READ requests outstanding in an attempt to
In a capsule-based system, the administrator can simply
fill the bandwidth between client and server as effectively
distribute an updated capsule to all the machines. In our

system, assuming that the machines already have the pre-
(a) Installations
vious version of the capsule, we only need to send the
3 MB
latest COW disk containing all the changes. Our results
null
make-dic
show that using HCP to transfer the COW disks reduces
gnome-core
2 MB
the transfer amounts to levels competitive or better than
ghc5
wvdial
current software install and update techniques. We con-
zebra
sider three system administration tasks in the following:
gnats-user
1 MB
sbcl
upgrading an operating system, installing software pack-
amaya
ages, and updating software packages.
mega
5.1.1
Operating System Upgrade
10 kB
100 kB
1 MB
10 MB
100 MB
Our first experiment is to measure the amount of traffic
(b) Updates
incurred when updating Red Hat version 7.2 to version
7.3. In this case, the system administrator is likely to
0 MB
start from scratch and create a new root disk, instead of
null
updating version 7.2 and capturing the changes in a COW
zlib
-20 MB
disk. The installation created a 1.6 GB capsule. Hashing
apache
bind
this capsule against a hash cache containing version 7.2
glibc
found 30% of the data to be redundant. With gzip, we
-40 MB
large
only need to transfer 25% of the 1.6 GB capsule.
A full operating system upgrade will be a lengthy opera-
-60 MB
tion regardless of the method of delivery, due to the large
10 kB
100 kB
1 MB
10 MB
100 MB
amount of data that must be transferred across the net-
work. Use of capsules may be an advantage for such up-
Figure 5: Difference in size between the HCP transfer of the
grades because data transfer can take place in the back-
COW disk holding the changes and (a) the installed packages,
(b) the update packages.
ground while the user is using an older version of the
capsule being upgraded (or a completely different cap-
sule).
to an update of 115 packages installed previously and 7
5.1.2
Software Installations and Updates
new packages pulled in by the updates. The software up-
dates used were not binary patches; as with an install,
For this experiment, we installed several packages into a
they included new versions of all the files in a software
capsule containing Debian GNU/Linux 3.0 and upgraded
package, a customary upgrade method for Debian and
several packages in a capsule containing Red Hat Linux
Red Hat systems.
7.2. Red Hat was chosen for the latter experiment be-
For reference, we also include the “null” data point
cause out-of-date packages were more readily available.
which corresponds to the size of the COW disk created
In each case, we booted up the capsule, logged in as root,
by simply logging in as root and shutting down the cap-
ran the Debian apt-get or Red Hat apt-rpm to download
sule without updating any software. This amounts to
and install a new package, configured the software, and
about 200 KB after HCP and gzip, consisting of i-nodes
saved the capsule as a child of the original one. We mi-
written due to updated access times, temporary files writ-
grated the child capsule to another machine that already
ten at boot, and so on.
had the parent cached. To reduce the COW disk size,
As shown in the figure, transfers of small installations
software packages were downloaded to a temporary disk
and updates are dominated by the installer rewriting
which we manually removed from the capsule after shut-
two 6 MB text databases of available software. Hash-
down.
ing sometimes saves us from having to send the entire
Figure 5 shows the difference in size between the trans-
database, but not always, due to insertions that change
fer of the new COW disk using HCP versus the size of
all the pages. The different results for make-dic and wv-
the software packages. Figure 5(a) shows installations
dial illustrate this effect. On larger installs, the cost of
of some well-known packages; the data point labeled
transferring the disk via HCP is near that of the orig-
“mega” corresponds to an installation of 492 packages,
inal package; the overhead of the installer database is
including the X Window System and TEX. Shown in Fig-
bounded by a constant and gzip does a good job of com-
ure 5(b) are updates to a number of previously installed
pressing the data. For larger updates, HCP sent less data
applications; the data point labeled “large” corresponds
than the packages because many of these updates con-

tained only minor changes from previous versions (such
(a) Data transferred, after gzip (MB)
120
as security patches and bug fixes), so that hashing found
execution
migration
100
similarities to older, already installed packages. In our
80
experiment, for updates over 10 MB, the savings amount
60
to about 40% in each case.
40
The results show that distributing COW disks via HCP
20
is a reasonable alternative to current software install and
0
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
update techniques. Package installations and upgrades
unballooned ballooned
unballooned ballooned
unballooned ballooned
unballooned ballooned
interactive apps on Windows
smttls on Linux
kernel compile on Linux
Apache on Linux
incur a relatively low fixed cost, so further benefits can
(b) Time for DSL tests (minutes)
be gained by batching smaller installs. In the case of up-
execution
60
migration
dates, HCP can exploit similarities between the new and
old packages to decrease the amount of data transferred.
40
The convenience of a less tedious and error-prone update
method is another advantage.
20
0
5.2
Migration of Active Capsules
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
unballooned ballooned
unballooned ballooned
unballooned ballooned
unballooned ballooned
interactive apps on Windows
smttls on Linux
kernel compile on Linux
Apache on Linux
To show how capsules support user mobility, we per-
(c) Time for LAN tests (minutes)
10
formed two sets of experiments, the first on a Windows
execution
migration
8
2000 capsule, the second on a Linux capsule.
6
First, we simulated the workload of a knowledge worker
with a set of GUI-intensive applications on the Windows
4
2000 operating system. We used Rational Visual Test
2
software to record user activities and generate scripts that
0
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
nh h hp
can be played back repeatedly under different test con-
unballooned ballooned
unballooned ballooned
unballooned ballooned
unballooned ballooned
interactive apps on Windows
smttls on Linux
kernel compile on Linux
Apache on Linux
ditions. We started a number of common applications,
including Microsoft Word, Excel, and PowerPoint, plus
Figure 6: Migration experiments. Data transferred for remote
Forte, a Java programming environment, loaded up some
activations and executions are shown after gzip in (a). Time to
large documents and Java sources, saved the active cap-
activate and run the experiments are shown for 384 kbps DSL
sule, migrated it to another machine, and proceeded to
in (b) and 100 Mbps switched Ethernet in (c). Labels “nh”,
“h”, and “hp” denote no hashing, hashing with an empty hash
use each of the four applications.
cache, and hashing with a primed cache, respectively.
On Linux, we tested migration with less-interactive and
more CPU- and I/O-bound jobs. We chose three applica-
tions: processor simulation with smttls, Linux kernel
compilations with GCC, and web serving with Apache.
data or time it took to execute the task once started.
We imagine that it would be useful to migrate large pro-
Figure 6(a) shows the amounts of data transferred over
cessor simulations to machines that might have become
the network during each migration and execution after
idle, or migrate fully configured webservers dynamically
gzip. These amounts are independent of the network
to adjust to current demand. For each experiment, we
speed assumed in the experiment. The memory image
performed a task, migrated the capsule, then repeated the
is transferred during the migration step, and disk data are
same task.
transferred on demand during the execution step. Gzip
To evaluate the contributions of each of our optimiza-
by itself compresses the 256 MB of memory data to 75–
tions, we ran each experiment twelve times. The exper-
115 MB. Except for the Windows interactive benchmark,
imental results are shown in Figure 6. We experimented
none of the applications incurs uncached disk accesses
with two network speeds, 384 kbps DSL and switched
during unballooned execution.
100 Mbps Ethernet. For each speed, we compared the
Hashing against an empty cache has little effect because
performance obtained with and without the use of bal-
it can only find similarities within the data being trans-
looning. We also varied the hashing scheme: the ex-
ferred. Our results show that either hashing against a
periments were run with no hashing, with hashing start-
primed disk or ballooning alone can greatly reduce the
ing from an empty hash cache, and with hashing starting
amount of memory data transferred to 10–45 MB. By
from a hash cache primed with the contents of the cap-
finding similarities between the old and new capsules,
sule disk. Each run has two measurements: “migration,”
primed hashing reduces the amount of data transferred
the data or time to start the capsule, and “execution,” the
both during migration and execution. While balloon-

ing reduces the amount of memory data that needs to
The processor simulation, kernel compile, and Apache
be transferred, it does so with the possible expense of
tasks take about 3, 4, and 1 minutes, respectively, to exe-
increasing the data transferred during the execution. Its
cute when running under VMware locally. Without bal-
effectiveness in reducing the total amount of data trans-
looning, these applications run mainly from memory, so
ferred is application-dependent; all but Apache, which
remote execution on either LAN or DSL is no slower
has a large working set, benefit tremendously from bal-
than local execution. Ballooning, however, can increase
looning.
Combining ballooning with primed hashing
run time, especially in the case of Apache.
generally results in the least amount of data transferred.
Our results show that active capsules can be migrated ef-
The timing results obtained on a 384 kbps DSL link,
ficiently to support user mobility. For users with high-
shown in Figure 6(b), follow the same pattern found in
speed connectivity, such as students living in a university
Figure 6(a). The execution takes longer proportionally
dormitory, memory images can be transferred without
because it involves computation and not just data trans-
ballooning or hashing in just a few minutes. For users
fer. With no optimization, it takes 29–44 minutes just to
with DSL links, there are two separate scenarios. In the
transfer the memory image over before the capsule can
case of a commute between work and home, it is likely
start executing. Hashing with priming reduces the start-
that an earlier capsule can be found at the destination,
up to less than 20 minutes in the Windows interactive
so that hashing can be used to migrate an unballooned
experiment, and less than 6 minutes in all the other appli-
memory image. However, to use a foreign capsule, bal-
cations. Ballooning also reduces the start-up time further
looning is helpful to reduce the start-up time of many
to 3–16 minutes. Again, combining both ballooning and
applications.
priming yields the best result in most cases. As the one
exception here, Apache demonstrates that ballooning ap-
5.3
Active Versus Inactive Capsules
plications with a large working set can slow them down
significantly.
The use of capsules makes it possible for a machine in a
Collective system to run an application without first hav-
Hashing is designed as an optimization for slow network
ing to install the application or even the operating sys-
connections; on a fast network, hashing can only slow
tem on which the application runs. It is also possible for
the transfer as a result of its computational overhead.
a user to continue the execution of an active capsule on
Figure 6(c) shows this effect. Hashing against a primed
a different machine without having first to boot up the
cache is even worse because of the additional verifica-
machine, log on, and run the application.
tion performed to ensure that the blocks on the destina-
tion machine match the hash. This figure shows that it
We ran experiments to compare these two modes of op-
takes only about 3 minutes to transfer an unballooned
eration. These experiments involved browsing a web-
image, and less than 2 minutes ballooned. Again, except
page local to the capsule using Mozilla running on Linux.
for Apache which experiences a slight slowdown, bal-
From the experiment results, we see that both active and
looning decreases both the start-up time as well as the
inactive capsules are useful in different scenarios and that
overall time.
using capsules is easier and takes less time than installing
the required software on the machine.
The Windows experiment has two parts, an interactive
part using Word, Excel, and PowerPoint on a number
The different scenarios we considered are:
of large documents, followed by compiling a source file
and building a Java archive (JAR) file in Forte. The for-
1. We mounted the inactive capsule file using NFS
mer takes a user about 5 minutes to complete and the
over the DSL link. We booted the inactive capsule,
latter takes about 45 seconds when running locally us-
ran Mozilla, and browsed a local webpage. The re-
ing VMware. In our test, Visual Test plays back the
sults for this test are shown in Figure 7 with the label
keystrokes and mouse clicks as quickly as possible. Over
NFS.
a LAN with primed hashing, the interactive part takes
2. We used demand paging to boot the inactive capsule
only 1.3 minutes to complete and Forte takes 1.8 min-
and ran Mozilla to browse the local webpage. We
utes. Over DSL with primed hashing, the interactive part
considered three alternatives: the machine had not
takes 4.4 minutes and Forte takes 7 minutes. On both
executed a similar capsule before and therefore had
the DSL and LAN, the user sees an adequate interactive
an empty hash cache, the machine had not executed
response. Forte is slower on DSL because it performs
the capsule before but the hash cache was primed
many small reads and writes. The reads are synchronous
with the disk state of the capsule, and the machine
and sensitive to the slow DSL link. Also, the first write to
had executed the same capsule before and hence the
a 16 KB COW block will incur a read of the block unless
capsule’s shadow disk had the required blocks lo-
the write fills the block, which is rarely the case.
cally cached. The results are shown in Figure 7 un-

der the labels boot, boot-p, and boot2 respectively.
5.4
Capsule Snapshots
3. We activated an active remote capsule that was al-
We simulate the migration of a user between work and
ready running a browser. We ran it with and without
home machines using a series of snapshots based on the
ballooning, and with and without priming the hash
Business Winstone 2001 benchmark. These simulation
cache with the inactive capsule disk. The results are
experiments show that migration can be achieved within
shown in the figure under the labels active, active-b,
a typical user’s commute time.
active-p, and active-bp.
The Winstone benchmark exercises ten popular applica-
tions: Access, Excel, FrontPage, PowerPoint, Word, Mi-
40
crosoft Project, Lotus Notes, WinZip, Norton AntiVirus,
30
execution
and Netscape Communicator. The benchmark replays
20
migration
user interaction as fast as possible, so the resulting user
10
session represents a time-compressed sequence of user
0
NFS
boot
boot-p
boot2
active
active-b active-p active-bp
input events, producing large amounts of stress on the
computer in a short time.
Figure 7: Times for activating a browser capsule (in minutes).
The capsules are NFS, booted with an NFS-mounted disk; boot,
To produce our Winstone snapshots, we ran one itera-
a remote capsule booted with demand paging and unprimed
tion of the Winstone test suite, taking complete images
database; boot2, the same remote capsule booted a second time;
of the machine state every minute during its execution.
and active, migration of a capsule with a running browser. Suf-
We took twelve snapshots, starting three minutes into the
fix “b” indicates that ballooning was done and suffix “p” indi-
cates that a primed database was used.
execution of the benchmark. Winstone spends roughly
the first three minutes of its execution copying the ap-
The bars in Figure 7 show the time taken while perform-
plication programs and data it plans to use and begins
ing the test. The times are split into execution and mi-
the actual workload only after this copying finishes. The
gration times. As expected, the four inactive capsules in
snapshots were taken after invoking the balloon process
the first two scenarios have negligible migration times,
to reduce the user’s memory state.
while execution times are negligible for the four active
To simulate the effect of a user using a machine alter-
capsules in the last scenario. When comparing the dif-
nately at work and home, we measured the transfer of
ferent scenarios we consider the total times as a sum of
snapshot to a machine that already held all the previ-
migration time and execution time.
ous snapshots. Figure 8 shows the amount of data trans-
In this experiment, demand paging, even with an empty
ferred for both the disk and memory images of snapshot
hash cache, performed much better than NFS. Demand
2 through 12. It includes the amount of data transferred
paging brought down the total time from about 42 min-
with and without hashing, and with and without gzip.
utes for NFS to about 21 minutes. When the cache was
The amount of data in the COW disk of each snapshot
warmed up with a similar capsule, the total time for the
varied depending on the amount of disk traffic that Win-
inactive capsule dropped to about 10 minutes. When the
stone generated during that snapshot execution.
The
inactive capsule was activated again with the required
large size of the snapshot 2 COW disk is due to Win-
blocks locally cached in the capsule’s shadow disk, it
stone copying a good deal of data at the beginning of
took only 1.8 minutes, comparable to boot of a local cap-
the benchmark. The size of the COW disks of all the
sule with VMware. Using an active capsule with no bal-
other snapshots range from 2 to 22 MB after gzip, and
looning or priming required about 12 minutes. Balloon-
can be transferred over completely under about 8 min-
ing the active capsule brought the time down to about 10
utes. Whereas hashing along with gzip compresses the
minutes, and priming the hash cache brought it down fur-
COW disks to about 10–30% of their raw size, it com-
ther to about 4 minutes, comparable to the time taken to
presses the memory images to about 2–6% of their raw
boot a local machine and bring up Mozilla. These times
size. The latter reduction is due to the effect of the bal-
are much less than the time taken to install the required
looning process writing zero pages in memory. The sizes
software on the machine.
of ballooned and compressed memory images are fairly
These results suggest that: (a) if a user has previously
constant across all the snapshots. The memory images
used the inactive capsule, then the user should boot that
require a transfer of only 6–17 MB of data, which takes
capsule up and use it, (b) otherwise, if the user has pre-
no more than about 6 minutes on a DSL link. The results
viously used a similar capsule, the user should use an ac-
suggest that the time needed to transfer a new memory
tive capsule, and (c) otherwise, if executing the capsule
image, and even the capsule disk in most cases, is well
for the first time, the user should use an active ballooned
within a typical user’s commute time.
capsule.

(a) COW disks data transferred (MB)
Other work has looked at migration and checkpoint-
500
ing at process and object granularities.
Systems
working at process level include V[26], Condor[16],
400
libckpt[21], and CoCheck[25]. Object-level systems in-
raw
300
gzip
clude Legion[10], Emerald[14], and Rover[13].
hash
200
hash+gzip
LBFS[19] provided inspiration for HCP and the hash
cache. Whereas LBFS splits blocks based on a finger-
100
print function, HCP hashes page-aligned pages to im-
0
prove performance on memory and disk images. Man-
2
3
4
5
6
7
8
9
10
11
12
ber’s SIF[17] uses content-based fingerprinting of files
(b) Memory images data transferred (MB)
to summarize and identify similar files.
40
7
Conclusions
30
gzip
In this paper, we have shown a system that moves a com-
hash
20
hash+gzip
puter’s state over a slow DSL link in minutes rather than
hours. On a 384kbps DSL link, capsules in our experi-
10
ments move in at most 20 minutes and often much less.
0
We examined four optimization techniques. By using
2
3
4
5
6
7
8
9
10
11
12
copy-on-write (COW) disks to capture the updates to
disks, the amount of state transferred to update a cap-
Figure 8: Snapshots from the Winstone benchmark showing (a)
sule is proportional to the modifications made in the cap-
disks and (b) memory images transferred. Raw sizes not shown
in (b) are constant at about 256 MB.
sule. Although COW disks created by installing soft-
ware can be large, they are not much larger than the in-
staller and more convenient for managing large numbers
6
Related Work
of machines. Demand paging fetches only the portion
of the capsule disk requested by the user’s tasks. “Bal-
Much work was done in the 1970s on virtual machines
looning” removes non-essential data from the memory,
at the hardware level[9] and interest has recently revived
thus decreasing the time to transfer the memory image.
with the Disco[4] and Denali[30] projects and VMware
Together with demand paging, ballooning leads to fast
GSX Server[28]. Earlier work demonstrated the isola-
loading of new capsules. Hashing exploits similarities
tion, performance, and economic properties of virtual
between related capsules to speed up the data transfer
machines. Chen and Noble suggested using hardware-
on slow networks. Hashing is especially useful for com-
level virtual machines for user mobility[6]. Kozuch and
pressing memory images on user commutes and disk im-
Satyanarayanan independently came up with the idea of
ages on software updates.
using VMware’s x86 VMMs to achieve mobility[15].
Hopefully, future systems can take advantage of our tech-
Others have also looked at distributing disk images for
niques for fast capsule migration to make computers eas-
managing large groups of machines. Work by Rauch et
ier to use and maintain.
al. on partition repositories explored maintaining clusters
of machines by distributing raw disk images from a cen-
8
Acknowledgments
tral repository[22]. Rauch’s focus is on reducing the size
of the repository; ours is on reducing time spent send-
This research is supported in part by National Sci-
ing disk images over a WAN. Their system, like ours,
ence Foundation under Grant No. 0121481 and Stanford
reduces the size of successive images by storing only the
Graduate Fellowships. We thank Charles Orgish for dis-
differences between revisions. They also use hashes to
cussions on system management, James Norris for work-
detect duplicate blocks and store only one copy of each
ing on an earlier prototype, our shepherd Jay Lepreau,
block. Emulab[31], Cluster-on-Demand[18], and others,
Mike Hilber, and David Brumley for helpful comments.
are also distributing disk images to help maintain groups
of computers.
References
The term capsule was introduced earlier by one of the
[1] FIPS 180-1. Announcement of weakness in the se-
authors and Schmidt[24]. In that work, capsules were
cure hash standard. Technical report, National In-
implemented in the Solaris operating system and only
stitute of Standards and Technology (NIST), April
groups of Solaris processes could be migrated.
1994.

[2] A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau.
[18] J. Moore and J. Chase. Cluster on demand. Tech-
Information and control in gray-box systems. In
nical report, Duke University, May 2002.
Proceedings of the 18th ACM Symposium on Oper-
[19] A. Muthitacharoen, B. Chen, and D. Mazi`eres. A
ating Systems Principles (SOSP ’01), pages 43–59,
low-bandwidth network file system. In Proceed-
October 2001.
ings of the 18th ACM Symposium on Operating Sys-
[3] A. A. Awadallah and M. Rosenblum. The vMa-
tems Principles (SOSP ’01), pages 174–187, Octo-
trix: A network of virtual machine monitors for
ber 2001.
dynamic content distribution. In Seventh Interna-
[20] M. Muuss. The story of T-TCP.
tional Workshop on Web Content Caching and Dis-
[21] J. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt:
tribution, August 2002.
Transparent checkpointing under Unix. In Proceed-
[4] E. Bugnion, S. Devine, and M. Rosenblum. Disco:
ings of the USENIX Winter 1995 Technical Confer-
Running commodity operating systems on scalable
ence, pages 213–224, January 1995.
multiprocessors. ACM Transactions on Computer
[22] F. Rauch, C. Kurmann, and T. Stricker. Partition
Systems, 15(4):412–447, November 1997.
repositories for partition cloning—OS independent
[5] F. Chabaud and A. Joux. Differential collisions in
software maintenance in large clusters of PCs. In
SHA-0. In Proceedings of CRYPTO ’98, 18th An-
Proceedings of the IEEE International Conference
nual International Cryptology Conference, pages
on Cluster Computing 2000, pages 233–242, 2000.
56–71, August 1998.
[23] L. Rizzo. Dummynet: a simple approach to the
[6] P. M. Chen and B. D. Noble. When virtual is better
evaluation of network protocols. ACM Computer
than real. In Proceedings of the 8th IEEE Workshop
Communication Review, 27(1):31–41, Jan. 1997.
on Hot Topics on Operating Systems, May 2001.
[24] B. K. Schmidt. Supporting Ubiquitous Computing
[7] http://www.connectix.com/.
with Stateless Consoles and Computation Caches.
[8] P. Deutsch. Zlib compressed data format specifica-
PhD thesis, Computer Science Department, Stan-
tion version 3.3, May 1996.
ford University, August 2000.
[9] R. P. Goldberg. Survey of virtual machine research.
[25] G. Stellner. CoCheck: Checkpointing and process
Computer, 7(6):34–45, June 1974.
migration for MPI. In Proceedings of the 10th In-
[10] A.
Grimshaw,
A.
Ferrari,
F.
Knabe,
and
ternational Parallel Processing Symposium, pages
M. Humphrey.
Legion:
An operating system
526–531, April 1996.
for wide-area computing.
Technical Report CS-
[26] M. M. Theimer, K. A. Lantz, and D. R. Cheriton.
99-12, Dept. of Computer Science, University of
Preemptable remote execution facilities for the V-
Virginia, March 1999.
system. In Proc. 10th Symposium on Operating
[11] IA-32
Intel
architecture
software
Systems Principles, pages 10–12, December 1985.
developer’s
manual
volumes
1-3.
[27] A. Tridgell. Efficient Algorithms for Sorting and
http://developer.intel.com/design/pentium4/manuals/.
Synchronization. PhD thesis, Australian National
[12] IBM Virtual Machine/370 Planning Guide. IBM
University, April 2000.
Corporation, 1972.
[28] “GSX
server”,
white
paper.
[13] A. Joseph, J. Tauber, and M. Kaashoek. Mobile
http://www.vmware.com/pdf/gsx whitepaper.pdf.
computing with the Rover toolkit. IEEE Transac-
[29] C. A. Waldspurger. Memory resource management
tions on Computers, 46(3):337–352, March 1997.
in VMware ESX server. In Proceedings of the Fifth
[14] E. Jul, H. Levy, N. Hutchinson, and A. Black. Fine-
Symposium on Operating Systems Design and Im-
grained mobility in the Emerald system.
ACM
plementation, December 2002.
Transaction on Computer Systems, 6(1):109–133,
[30] A. Whitaker, M. Shaw, and S. Gribble.
Denali:
February 1988.
Lightweight virtual machines for distributed and
[15] M. Kozuch and M. Satyanarayanan. Internet sus-
networked applications. Technical report, Univer-
pend/resume.
In Proceedings of the Workshop
sity of Washington, February 2001.
on Mobile Computing Systems and Applications,
[31] B. White et al. An integrated experimental environ-
pages 40–46, June 2002.
ment for distributed systems and networks. In Pro-
[16] M. Litzkow, M. Livny, and M. Mutka. Condor – a
ceedings of the Fifth Symposium on Operating Sys-
hunter of idle workstations. In Proceedings of the
tems Design and Implementation, December 2002.
8th International Conference on Distributed Com-
[32] Wintel
architecture
specifications.
puting Systems, pages 104–111, June 1988.
http://www.microsoft.com/hwdev/specs/.
[17] U. Manber. Finding similar files in a large file sys-
tem. In Proceedings of the USENIX Winter 1994
Technical Conference
, pages 1–10, 17–21 1994.

Document Outline