If you time the steps you take correctly, this can be pretty easy. Most important is to get a buffer of your source file that is not going to implode if overworked. The only real way to do that is with another file - which the shell makes very easy to do.
{ head -n "$((num_lines_before_insert))"
grep key temp_file; sed \$d
} <<SOURCE_FILE >desired.txt
$( cat <source_file;echo .)
SOURCE_FILE
So, for most shells, (to include bash
and zsh
, but not dash
or yash
) when you get a <<
here_document the shell creates a uniquely named temp file in ${TMPDIR:-/tmp}
, exec
s it on the input file descriptor you specify - (or, by default, just 0) - and promptly deletes it. By the time it is served as input to your command, it is an un_named file - it has no remaining links to any filesystem and is just waiting for the kernel to clean it up before it disappears completely. It is a proper file - its data exists somewhere on disk (or, at least, within VFS in the likely case of tmpfs
) and the kernel will ensure it continues to do so at least until you release the file descriptor.
In that way - for so long as your shell gets an actual backing file for the heredoc - they represent very secure and simple means of handling temporary file needs because they are fully written and all filesystem names are already removed from them before ever you read them. So their data cannot be tampered with while you work.
The above block first writes the temp file with cat
and preserves any/all trailing blank-lines from the command-substitution with echo
- which adds a single line to the tail of the file. From the {
compound command }
statement the output of its three three commands is written to desired.txt
- two of which read in their turn from the heredocument the head
and tail of the source file - and the grep
command which inserts your key
match.
I'm not certain if you needed this - but I thought it was relevant to show that you can simply and safely fully overwrite a source file with a sequence like this.
If your shell doesn't get an actual file for heredocs, you can emulate what it does like...
{ set "$$" "${TMPDIR:-/tmp}" "$@"
exec <"$2/$( set -C
>"$2/$1" cat &&
echo "$1")" >&1
rm -- "$2/$1";shift 2
head "-n$((before))"
grep ... keyfile; cat
} <source_file 1<>source_file
...which will ensure all files are writable and safely assigned to file-descriptors before taking any irreversible action, but also does all filesystem cleanup before doing same.
Here is a test I ran to demonstrate this:
cd /tmp
set "$$" "${TMPDIR:-/tmp}" "$@"
seq 5000000 >test
printf line\ %s\\n 1 2 3 4 5 >test2
{ exec <"$2/$( set -C
>"$2/$1" cat &&
echo "$1")" >&1
rm -- "$2/$1";shift 2
head -n2500000
grep 3 test2;cat
} <test 1<>test
This first created two files - one called /tmp/test
which was just 5 million numbered lines as written by seq
and a second called /tmp/test2
which was just 5 lines like...
line 1
line 2
line 3
line 4
line 5
I next ran the above block, then I did...
sed -n '1p;$p;2499999,2500002l' <test
wc -l test
...which, interestingly, took practically the same amount of time to perform as the insert operation, and printed:
1
2499999$
2500000$
line 3$
2500001$
5000000
5000001 test
So here's how this works:
- The
1<>
redirection is important - it sets the O_RDWR flag on stdout and ensures that as each process writes into the file it writes over the file's previous contents. In other words, this means that at no point is the source/destination file ever truncated, but is rather rewritten head to tail.
- The command substitution for
exec
gets the racy part done as soon as is possible (or as soon as I know it can be). Within the command sub noclobber is set
so if "${TMPDIR:-/tmp}/$$"
already exists the expansion results in exec <"${TMPDIR:-/tmp}/"
which, in an interactive shell will cease the whole process right away, or, in a script, will cause the script to exit with a meaningful error as the shell cannot exec
a directory as stdin.
- Within the command sub
cat
copies source_file
to a temp file that doesn't already exist and echo
writes the name to stdout.
- As soon as all file handles are
exec
ed rm
unlink()
s the new temp file so its only fleeting claim to existence now is <
the redirect it was just assigned.
head
seeks through 2.5mil lines and writes over source_file
's first 2.5mil lines. The point is to seek through both files to equal offsets.
- That in mind, this portion could be more i/o efficient if the newly created tmp file is on a tmpfs and the source file is on a disk if the i/o were reversed here and
head
read from the on-disk file and wrote to the file in RAM.
- If you wanted to do that though you'd need to do
exec <>"$(... head ... <&1 >&0
to make the tmp file read/writable and maybe use head
/tail
with a specified number of lines for the tail end. In that case the number need not even be exact - you can loop over input in similar fashion - advancing the offset only a little at a time. The shell's builtin read
can be used to test for EOF - or wc
can be used at loop open.
- This is because
cat
will probably just hang on a <>
stdin becuase it will never see EOF.
grep
reads some data from some other file and writes it into source_file
overwriting only as many bytes as it read from elsewhere.
cat
corrects whatever discrepancy grep
may just have caused by writing what remains of its stdin out to its stdout 1<>source_file
.
dd
and simply write over the chosen line in-place). – orion Feb 04 '15 at 09:28