I have a process running on one computer that spawns simulations by writing the simulation data to directory pre/id. Worker processes then copy a simulation from pre to a local disk, which can be on a different computer. pre is in a volume mounted with nfs. This part works well.
When a simulation is done, the results are moved to the directory result/id, which is what is causing trouble. The supervising process can decide to keep such a directory or to delete it. Occasionally, when it tries to delete result/id, the move operation seems to be incomplete, and removing the directory fails.
Everything runs on a variety of linux flavors. The workers move directories around using mv and then touch result/id/done to signal to the supervising process that the result can be read (and deleted). The supervising process uses boost::filesystem::remove_all to delete result/id.
How can I reliably wait for the move operation to be completed, before attempting to delete it?
Added: This code moves the result directory to where the supervising process waits for it:
mv $tempDir $finishedCasesDir # copy case to result directory
touch $finishedCasesDir/$caseName/done
This is the C++ code that waits for done to appear:
if(is_regular_file(resultPath/"done"))
{
  // get relevant result data
  ...
  // remove result directory
  remove_all(resultPath);
}
And the error:
terminate called after throwing an instance of 'boost::filesystem3::filesystem_error'
what():  boost::filesystem::remove: Directory not empty: "results/711a35ed-818e-4084-ab43-47531fdd8d11"
 
     
     
     
     
    
result/id/doneto signal that the move is complete. If done correctly, it should be enough for the supervising process to check for this file's existence. So why isn't this enough? – alexis Feb 13 '14 at 11:56donedoes not exist in the directory prior to being moved? – alexis Feb 13 '14 at 12:13done. When a simulation is run, a possibly existingdonefile is removed before the simulation starts. – Christoph Feb 13 '14 at 12:17remove_allremoves recursively. Also the code only fails occasionally - I can delete my result directories 15k times without any problems, and then it suddenly fails. That's why I concluded that there's something still being written into the directory bymv. – Christoph Feb 13 '14 at 13:17doneaftermvexits, it is not the source of the problem. Either a different process sometimes writes in the same directory (could you have id collisions?), orremove_allcan fail to find and remove all files before removing the directory. What's left in the directory when you encounter a failure? – alexis Feb 13 '14 at 13:39touch donethat works over nfs, or any other way of signalling to another process on another machine that it can harvest a result. – Christoph Feb 13 '14 at 13:58remove_allin a loop now until the error disappears - a bit brutal but it should have the desired effect without causing any harm. If I leave data in the directory, that could quickly add up to a few TB. Not good. – Christoph Feb 13 '14 at 14:40touchis not atomic?" I don't buy this scenario. On Unix, holding an open file descriptor does not lock the directory entry-- you can unlink away. The inode and blocks won't be freed until the descriptor is closed, but that doesn't get in the way of unlinking the directory. – alexis Feb 14 '14 at 00:05