Back to the Ideas Bakery
Lock-stepped Java Virtual Machines
If you are writing a program that is supposed to run for a long
time (i.e. several years), it would be nice to be able to have it
survive failures of computer hardware, system upgrades and natural
disasters without having to design that into the protocols of the
system. Programmers are now often writing to one of three virtual
machines (Java's VM; Microsoft's .NET CLR; or for those writing in
Perl, Python, PHP, Ruby or many others... the byte-codes of Parrot
once it is finished). Since performance is obviously not a great issue,
why not use the extra compute power of modern hardware to keep several
computers running the program in lock-step?
The idea for this comes from the space-shuttle's computers, where banks
of five computers perform every computation and then confirm with each
other that they all reached the same result. I've thought about it for
Java because it's the most prevalent and probably the easiest to implement, but
I don't think there is anything in this discussion which would be fundamentally
different for the others.
I'm envisaging that you would have a N computers running M virtual machines.
When you add a new computer to the mix, you would tell it which virtual machines
out of the M you want it to mirror. Then you could on-the-fly add and subtract
virtual machines and add and subtract mirrors of them.
Start up is not too difficult. You connect to another instance of a virtual
machine lock-stepper and tell it you would like to duplicate it.
(Obviously, you have some kind of authentication system such as a password to
make sure you're allowed to do this.) So you have A (the existing VM) and B (your
new one trying to catch up). A then sends a block of memory to B, and turns on
alert-on-write for that block of memory. Once B has confirmed receiving the block
(with a checksum to know that it arrived correctly), A will send the next block
of memory, and turn on alert-on-write. And so on. Meanwhile, A is doing more
processing; whenever it writes to a block of memory, it checks the alert-on-write.
If it was on, it marks some part of the block of memory "dirty". Once B has finished walking
through all of A's memory, A will then cycle through all the dirty pages (marking
them clean and sending them). The cycle keeps repeating, hopefully with fewer and
fewer dirty pages each time, until finally B catches up and can start processing (A
will send the program counter information so that B knows where to start).
Run-time only has a few difficulties. It's probably worth avoiding funny corner
cases (e.g. corrupt memory and a sneakily faulty CPU) by stopping every few thousand
instructions and comparing a checksum of a random page of memory between all the
running virtual machines.
The only difficult things are:
- Out-of-memory conditions
- The two Java virtual machines must be given allocations of the same
amount of memory initially, and they have to use the same allocation and
garbage collection policy. So if somebody tries to allocate (say) an array
of a million Objects, it will either work on both virtual machines, or it
will throw a MemoryError on both.
I think this is the only reason that you actually need to do any work at
the VM layer (i.e. you can just do this by rewriting libraries).
So you would probably need to specify at start-up time how much memory was
available to the virtual machine. But a lock-stepped environment is probably
going to take a dim view of shutting this important program down to allocate
more memory, so you need some way of adding an extra memory pool. But you
need to be able to ask "can all the computers running this VM now cope with
such an addition? If so, do it." Remember that the one computer could be running
several VMs; I'm envisaging they would all live in the same address space (after all,
why not?) Unlike C or C++
which defaults to having only one global pool for malloc/new, Ada9X has
a nice idea of X'Memory_Pool which would make the particular implementation detail
easy; but I don't know how you would de-allocate a memory pool (in either language).
- Disk I/O and filesystem I/O
- Haven't finished thinking about this one. Maybe you run unison over a directory
first. Or maybe you use a replicated filesystem (the ADFS from Tru64? Or next generation
Reiser?). Or ignore the problem, and assume you have a reliable disk storage system (e.g.
replicated EMC or Hitachi disk array).
The challenge here again is making sure that if one system is going to run
out of disk space, all are. At the same time, you want to be able to add and remove
disk space on the fly to keep the system running happily.
- TCP/IP and UDP/IP
- The virtual machine would have it's own IP address, allocated from
a globally-routable subnet that is used exclusively for lockstep VMs. Obviously
IPv6 would help here so that you don't run out of addressing space!
This IP address would need to be receiving data
on two different computers at the same time -- e.g. we receive the SYN on computer A,
reply back with an ACK, and then get data coming to computer B on that port number!
Originally I thought I would need to do some low-level operating system hacking to
get this to work (hack the TCP layer to allow synthetic-SYNs), but then I realised
that you just have to re-implement a PPP daemon in user-space. The virtual machine
allocates a TTY and starts its own PPP fakery at one end, and the operating system's
ordinary PPP daemon at the other end. These PPP links (one on each computer running
a virtual machine) transfer packets into the globally-routable subnet.
We will need to run RIP or OSPF as well around
the network so that other computers know how to get to this subnet -- if you are
near computer A and it is working, send it to A's ethernet address, where A's ordinary
IP stack will pass it through the pppd, through the tty and into the VM. See the
section on user input and other I/O for what happens next.
- Swing/graphical user interfaces/text output
- This one had me stumped for a while... in the end I realised you just need to
implement a VNC server. Then the lock-stepped IP stack from the paragraph above
does all the work for you.
- User input and other I/O
- This only gets a little tricky when you have two computers running at different
speeds. Suppose computer A has just received a blob of data; it says to B "blob of data 213
arrived when I was at instruction 234556" (note: should be a 128 bit integer), and then
freezes all processing.
- B could
have already been long past instruction 234556, so it might reply "let blob of data 213 arrive
at instruction 234778" (in which case the tables have turned, B freezes, A unfreezes and decides
what to do next)
- B could be on 234556, so it replies "acknowledge on data 213". A then sends the blob of
data.
- B could be before 234556; so it records the real-world time that A told it about the
data, and then processes as fast as it can to catch up. Once it finally reaches instruction
234556, it records the real-world time again and stores that off for performance and tuning
records. Then it replies "acknowledge on 213".
Obviously, tweaking this algorithm is going to be the big deciding factor in the speed of
the whole system -- I/O is what most systems bottleneck on, and the constant toing-and-froing is
going to make this bottleneck much worse.
I can't come up with any brilliant business models to support developing
this open source (and if not, then why bother?).
My best thought is to run some instance mirrors on behalf of clients
(i.e. the client runs one copy or more copies in house, we run another
copy in case there is a disaster (e.g. power failure) at their site.
Unfortunately, I don't think this will be particularly cost-effective. Nor
would the availability be that good; we would be probably unlikely to even
get to 99.999% availability (which is often called high-availability, rather
than fault-tolerance).