Lock-stepped Java Virtual Machines

If you are writing a program that is supposed to run for a long time (i.e. several years), it would be nice to be able to have it survive failures of computer hardware, system upgrades and natural disasters without having to design that into the protocols of the system. Programmers are now often writing to one of three virtual machines (Java's VM; Microsoft's .NET CLR; or for those writing in Perl, Python, PHP, Ruby or many others... the byte-codes of Parrot once it is finished). Since performance is obviously not a great issue, why not use the extra compute power of modern hardware to keep several computers running the program in lock-step?

The idea for this comes from the space-shuttle's computers, where banks of five computers perform every computation and then confirm with each other that they all reached the same result. I've thought about it for Java because it's the most prevalent and probably the easiest to implement, but I don't think there is anything in this discussion which would be fundamentally different for the others.

I'm envisaging that you would have a N computers running M virtual machines. When you add a new computer to the mix, you would tell it which virtual machines out of the M you want it to mirror. Then you could on-the-fly add and subtract virtual machines and add and subtract mirrors of them.

Start up is not too difficult. You connect to another instance of a virtual machine lock-stepper and tell it you would like to duplicate it. (Obviously, you have some kind of authentication system such as a password to make sure you're allowed to do this.) So you have A (the existing VM) and B (your new one trying to catch up). A then sends a block of memory to B, and turns on alert-on-write for that block of memory. Once B has confirmed receiving the block (with a checksum to know that it arrived correctly), A will send the next block of memory, and turn on alert-on-write. And so on. Meanwhile, A is doing more processing; whenever it writes to a block of memory, it checks the alert-on-write. If it was on, it marks some part of the block of memory "dirty". Once B has finished walking through all of A's memory, A will then cycle through all the dirty pages (marking them clean and sending them). The cycle keeps repeating, hopefully with fewer and fewer dirty pages each time, until finally B catches up and can start processing (A will send the program counter information so that B knows where to start).

Run-time only has a few difficulties. It's probably worth avoiding funny corner cases (e.g. corrupt memory and a sneakily faulty CPU) by stopping every few thousand instructions and comparing a checksum of a random page of memory between all the running virtual machines.

The only difficult things are:

Out-of-memory conditions

The two Java virtual machines must be given allocations of the same amount of memory initially, and they have to use the same allocation and garbage collection policy. So if somebody tries to allocate (say) an array of a million Objects, it will either work on both virtual machines, or it will throw a MemoryError on both.
I think this is the only reason that you actually need to do any work at the VM layer (i.e. you can just do this by rewriting libraries).
So you would probably need to specify at start-up time how much memory was available to the virtual machine. But a lock-stepped environment is probably going to take a dim view of shutting this important program down to allocate more memory, so you need some way of adding an extra memory pool. But you need to be able to ask "can all the computers running this VM now cope with such an addition? If so, do it." Remember that the one computer could be running several VMs; I'm envisaging they would all live in the same address space (after all, why not?) Unlike C or C++ which defaults to having only one global pool for malloc/new, Ada9X has a nice idea of X'Memory_Pool which would make the particular implementation detail easy; but I don't know how you would de-allocate a memory pool (in either language).

Disk I/O and filesystem I/O

Haven't finished thinking about this one. Maybe you run unison over a directory first. Or maybe you use a replicated filesystem (the ADFS from Tru64? Or next generation Reiser?). Or ignore the problem, and assume you have a reliable disk storage system (e.g. replicated EMC or Hitachi disk array). The challenge here again is making sure that if one system is going to run out of disk space, all are. At the same time, you want to be able to add and remove disk space on the fly to keep the system running happily.

TCP/IP and UDP/IP

The virtual machine would have it's own IP address, allocated from a globally-routable subnet that is used exclusively for lockstep VMs. Obviously IPv6 would help here so that you don't run out of addressing space! This IP address would need to be receiving data on two different computers at the same time -- e.g. we receive the SYN on computer A, reply back with an ACK, and then get data coming to computer B on that port number! Originally I thought I would need to do some low-level operating system hacking to get this to work (hack the TCP layer to allow synthetic-SYNs), but then I realised that you just have to re-implement a PPP daemon in user-space. The virtual machine allocates a TTY and starts its own PPP fakery at one end, and the operating system's ordinary PPP daemon at the other end. These PPP links (one on each computer running a virtual machine) transfer packets into the globally-routable subnet. We will need to run RIP or OSPF as well around the network so that other computers know how to get to this subnet -- if you are near computer A and it is working, send it to A's ethernet address, where A's ordinary IP stack will pass it through the pppd, through the tty and into the VM. See the section on user input and other I/O for what happens next.

Swing/graphical user interfaces/text output

This one had me stumped for a while... in the end I realised you just need to implement a VNC server. Then the lock-stepped IP stack from the paragraph above does all the work for you.

User input and other I/O

This only gets a little tricky when you have two computers running at different speeds. Suppose computer A has just received a blob of data; it says to B "blob of data 213 arrived when I was at instruction 234556" (note: should be a 128 bit integer), and then freezes all processing.

B could have already been long past instruction 234556, so it might reply "let blob of data 213 arrive at instruction 234778" (in which case the tables have turned, B freezes, A unfreezes and decides what to do next)
B could be on 234556, so it replies "acknowledge on data 213". A then sends the blob of data.
B could be before 234556; so it records the real-world time that A told it about the data, and then processes as fast as it can to catch up. Once it finally reaches instruction 234556, it records the real-world time again and stores that off for performance and tuning records. Then it replies "acknowledge on 213".

Obviously, tweaking this algorithm is going to be the big deciding factor in the speed of the whole system -- I/O is what most systems bottleneck on, and the constant toing-and-froing is going to make this bottleneck much worse.

I can't come up with any brilliant business models to support developing this open source (and if not, then why bother?). My best thought is to run some instance mirrors on behalf of clients (i.e. the client runs one copy or more copies in house, we run another copy in case there is a disaster (e.g. power failure) at their site. Unfortunately, I don't think this will be particularly cost-effective. Nor would the availability be that good; we would be probably unlikely to even get to 99.999% availability (which is often called high-availability, rather than fault-tolerance).