I have a process running on one computer that spawns simulations by writing the simulation data to directory pre/id
. Worker processes then copy a simulation from pre
to a local disk, which can be on a different computer. pre
is in a volume mounted with nfs. This part works well.
When a simulation is done, the results are moved to the directory result/id
, which is what is causing trouble. The supervising process can decide to keep such a directory or to delete it. Occasionally, when it tries to delete result/id
, the move operation seems to be incomplete, and removing the directory fails.
Everything runs on a variety of linux flavors. The workers move directories around using mv
and then touch result/id/done
to signal to the supervising process that the result can be read (and deleted). The supervising process uses boost::filesystem::remove_all
to delete result/id
.
How can I reliably wait for the move operation to be completed, before attempting to delete it?
Added: This code moves the result directory to where the supervising process waits for it:
mv $tempDir $finishedCasesDir # copy case to result directory
touch $finishedCasesDir/$caseName/done
This is the C++ code that waits for done
to appear:
if(is_regular_file(resultPath/"done"))
{
// get relevant result data
...
// remove result directory
remove_all(resultPath);
}
And the error:
terminate called after throwing an instance of 'boost::filesystem3::filesystem_error'
what(): boost::filesystem::remove: Directory not empty: "results/711a35ed-818e-4084-ab43-47531fdd8d11"
result/id/done
to signal that the move is complete. If done correctly, it should be enough for the supervising process to check for this file's existence. So why isn't this enough? – alexis Feb 13 '14 at 11:56done
does not exist in the directory prior to being moved? – alexis Feb 13 '14 at 12:13done
. When a simulation is run, a possibly existingdone
file is removed before the simulation starts. – Christoph Feb 13 '14 at 12:17remove_all
removes recursively. Also the code only fails occasionally - I can delete my result directories 15k times without any problems, and then it suddenly fails. That's why I concluded that there's something still being written into the directory bymv
. – Christoph Feb 13 '14 at 13:17done
aftermv
exits, it is not the source of the problem. Either a different process sometimes writes in the same directory (could you have id collisions?), orremove_all
can fail to find and remove all files before removing the directory. What's left in the directory when you encounter a failure? – alexis Feb 13 '14 at 13:39touch done
that works over nfs, or any other way of signalling to another process on another machine that it can harvest a result. – Christoph Feb 13 '14 at 13:58remove_all
in a loop now until the error disappears - a bit brutal but it should have the desired effect without causing any harm. If I leave data in the directory, that could quickly add up to a few TB. Not good. – Christoph Feb 13 '14 at 14:40touch
is not atomic?" I don't buy this scenario. On Unix, holding an open file descriptor does not lock the directory entry-- you can unlink away. The inode and blocks won't be freed until the descriptor is closed, but that doesn't get in the way of unlinking the directory. – alexis Feb 14 '14 at 00:05