Why can't I overwrite a file with processed value?

Question

This question prompted a thought I was not sure I understood well. I know that it is not possible or correct to use pipelines like cat myfile | grep -v mypattern > myfile due to how file handles are set up. However, why can't we simply use cat myfile | grep -v mypattern| tee myfile>/dev/null to modify the file in place? Are there any simple examples where it fails?

Specifically, does it lead to corruption or is it more of not being inplace edit, but rather overwrite?

Updating the question, I would appreciate if the answers would consider this too:

Is there problems with using cat myfile | grep -v mypattern| bash -c 'rm myfile; cat > myfile'?

Note that most of the usual text-processing tools that have an in-place edit option (e.g, sed or perl's -i options) don't actually edit the files in-place. Instead, they write to a temporary file and mv it over the original when finished. or something very similar to that. Significantly, the temp file (and thus the original "in-place"-edited file) will have a new inode. So -i etc are merely convenience features to save you the minor bother of doing it yourself. BTW, if you really want to keep the same inode, you could script your edits with ed. — cas, May 01 '16 at 02:35
Reading from a file at the same time that you are overwriting it will never be reliable. Due to the vagaries of buffers and such, it may work occasionally. But, it will never be something to depend on. — John1024, May 01 '16 at 02:57

Thomas Dickey · Answer 1 · 2016-05-01T00:59:35.843

1

You can't simply do that because the tee command overwrites the file, making it shorter (probably) and eliminating the cat command's ability to read the data that was in the file.

If you could ensure that programs such as tee opened a new file, and if the shell guaranteed that cat opened its copy first, then you could copy from the old (actually deleted) file to the new. But there's a lot of ifs and few guarantees.

You might suppose, for instance, that cat would start first, and tee later (when it is needed to capture data). But the shell starts both, and unless tee is waiting for input before cat starts, the writes could fail (since no one is waiting, and those bytes have nowhere to go). It is easier to make processes to wait on a read than on a write.

edited May 01 '16 at 00:59

answered May 01 '16 at 00:37

Thomas Dickey

76,765

Ok, would `cat myfile | grep -v mypattern| bash -c 'rm myfile; cat > myfile' cause trouble? – Rahul Gopinath May 01 '16 at 00:44
In the case of bash -c 'rm...', am I not assured of having a file handle to the file before deletion? because shell will set up the file handles before running the program? – Rahul Gopinath May 01 '16 at 01:05
@rahul No you are not. Mostly it will probably work, but if the initialization of cat is slow (e.g the command is not in cache and has to be read from disk), then you risk the bash command will start before the reading cat gets going. Think: (sleep 1; cat myfile) | grep -v mypattern| bash -c 'rm myfile; cat > myfile; which clearly fails – Ole Tange May 01 '16 at 18:55

Ole Tange · Accepted Answer · 2016-05-01T13:57:20.770

1

The problem is that you cannot guarantee which is executed first. So you have to delay unlinking and writing to the file until you are absolutely sure that the file is opened for reading.

This will buffer the file in RAM before writing it.

cat foo | perl -e 'undef $/; @out=<>; open WRT,">",shift; print WRT @out' foo

Advantage: Keeps permissions of foo. If interrupted you have not lost the original foo.

Disadvantage: foo must fit in RAM.

This will open the file for reading, remove it, and cat from it. It in parallel wait for the file to disappear, and when it is gone cat to it.

(rm foo; cat) < foo | (perl -e 'while(-e "foo"){}'; cat >foo)

Advantage: Short. Works on files bigger than RAM.

Disadvantage: foo is gone as soon as you start.

(mv foo bar; cat) < foo | (perl -e 'while(-e "foo"){}'; cat >foo && rm bar)

Advantage: Works on files bigger than RAM. If fails, foo is kept as a backup in bar.

edited May 01 '16 at 13:57

answered May 01 '16 at 07:27

Ole Tange

35,514

"The problem is that you cannot guarantee which is executed first." I understand that no shell guarantees it, but doesn't every shell set up the file handles first before running any command? – Rahul Gopinath May 01 '16 at 16:08
Also, (rm foo; cat) < foo | (perl -e 'while(-e "foo"){}'; cat >foo) can we be sure that the file handle for cat >foo will not be opened before rm foo? – Rahul Gopinath May 01 '16 at 16:40
Yes, because of the ;. The command is not spawned before perl finishes. – Ole Tange May 01 '16 at 18:45
"but doesn't every shell set up the file handles first before running any command" This makes it worse: Before the reading cat starts reading, the writing cat will have truncated the file. Reading and writing the to the same actual file (not just the same file name) takes special care and is usually only done by database servers. – Ole Tange May 01 '16 at 18:48
Ok, I think I understand. – Rahul Gopinath May 01 '16 at 20:29

Why can't I overwrite a file with processed value?

2 Answers2