3

The job of my Unix executable file is to perform a long computation, and I added a interrupt/resume functionality to it as explained below.

At regular intervals, the program writes all relevant data found so far in a checkpoint file, which can then be used as a starting point for a "resume" operation.

To interrupt the program, I use Ctrl+C.
The only problem with this methodology is that, if the interruption occurs when the program is writing into the file, I am left with a useless half written file.

The only fix I could find so far is as follows:

  • make the program
  • write into two files, so that at restart time one of them will be readable.

Is there a cleaner, better way to create an "interruptable" Unix executable ?

James Youngman
  • 1,144
  • 5
  • 20
  • 2
    The thing you're asking about is called "checkpointing" so adding that word somewhere should allow future users to identify your question as relevant more easily. – James Youngman Oct 17 '16 at 08:28
  • @JamesYoungman Unfortunately, "checkpointing" is not a tag here, and I do not have enough ratings to create a new tag – Ewan Delanoy Oct 17 '16 at 08:33
  • 1
    Your main concern: “when the program is writing into the file, I am left with a useless half written file” can be addressed by writing a temporary file and then atomically replacing the target file with it. This ensures data at the expected location is always consistent. – phg Oct 17 '16 at 08:52
  • @phg I must confess that I do not really know what the "atomic" operations are in Unix. Could you clarify what you mean by "atomically replacing" ? – Ewan Delanoy Oct 17 '16 at 11:27
  • https://rcrowley.org/2010/01/06/things-unix-can-do-atomically.html – phg Oct 17 '16 at 11:50
  • You could "hold" (defer) the interrupt (Ctrl+C) signal while you're writing the checkpoint file. – G-Man Says 'Reinstate Monica' Oct 21 '16 at 00:43

2 Answers2

5

It depends a bit on if you care only about the program itself crashing, or the whole system crashing.

In the first case, you could write the fresh data to a new file, and then rename that to the real name only after you're done writing. That way the file will contain either the previous, or the new checkpoint data, but never only partial information. Though partial writes should be rare enough in any case, if we assume the checkpointing code itself is not likely to fail, and if relevant signals are trapped to make sure the program saves a new checkpoint in full before exiting. (In addition to SIGINT, I think you'd better catch SIGHUP and SIGTERM too.)

If we consider the possibility of the whole system crashing, then I wouldn't trust only one checkpoint file. The data is not likely to actually be on the disk when system returns from the file write system call. Instead, the OS and the disk itself are likely to cache the data and actually write it some time later. So leaving one or two previous checkpoints would work as a failsafe against that.

ilkkachu
  • 138,973
  • The use of more than one file is a good idea. Whether the extra burden if writing and maintaining garbage collection is worthwhile depends on the value of the computation saved and the frequency of file system corrupting system crashes. – James Youngman Oct 17 '16 at 08:31
  • Do you believe that sync and fsync aren't good enough to get the data written to the disk? – G-Man Says 'Reinstate Monica' Oct 20 '16 at 05:51
  • @G-Man, well for one, I don't trust all drives to actually write even though they might have promised. At least without battery-backup for the cache (or the whole drive). Second, after the O_PONIES fiasco, I'm a bit disillusioned about crash-safety. But yeah, if you have battery backup, trust your OS and remember to fsync both the new file and the containing directory after renaming a new file into place, then yeah, you might be safe. – ilkkachu Oct 20 '16 at 09:54
  • 1
    In any case, since you need to keep the previous checkpoint in place while writing the current one (to have at least one copy remaining at all times), you might as well leave the previous file there until you're ready to start writing the next one. – ilkkachu Oct 20 '16 at 10:00
4

You can catch the SIGINT signal that is sent to the process when Ctrl-C is pressed using a signal handler. Then the process isn't killed immediately, but the signal handler is called. In the signal handler you can then write the results to a file. This is the general idea, in practice you may have some finer details to take care of.

Johan Myréen
  • 13,168
  • 3
    You shouldn't do I/O in a signal handler. In general you shouldn't do anything beyond setting a flag. – user207421 Oct 16 '16 at 23:33
  • EJP is right, you should only, for example, set a global variable exiting to true. This flag should then be checked at regular intervals, and if true, the appropriate action should be take to save the state to a disk file. The challenge here is that the program should check the flag at sufficiently regular intervals, so that you don't miss the boat if the system is being shut down. – Johan Myréen Oct 21 '16 at 08:11