Efficient in-place header removing for large files using sed?

Question

The commands below may takes minutes depends on the file size. Is there any more effient method?

sed -i 1d large_file

manatwork · Accepted Answer · 2011-11-30T07:46:24.947

38

Try ed instead:

ed <<< $'1d\nwq' large_file

If that “large” means about 10 million lines or more, better use tail. Is not able for in-place editing, but its performance makes that lack forgivable:

tail -n +2 large_file > large_file.new

Edit to show some time differences:

(awk code by Jaypal added to have execution times on the same machine (CPU 2.2GHz).)

bash-4.2$ seq 1000000 > bigfile.txt # further file creations skipped

bash-4.2$ time sed -i 1d bigfile.txt
time 0m4.318s

bash-4.2$ time ed -s <<< $'1d\nwq' bigfile.txt
time 0m0.533s

bash-4.2$ time perl -pi -e 'undef$_ if$.==1' bigfile.txt
time 0m0.626s

bash-4.2$ time { tail -n +2 bigfile.txt > bigfile.new && mv -f bigfile.new bigfile.txt; }
time 0m0.034s

bash-4.2$ time { awk 'NR>1 {print}' bigfile.txt > newfile.txt && mv -f newfile.txt bigfile.txt; }
time 0m0.328s

edited Nov 30 '11 at 07:46

answered Nov 29 '11 at 10:16

manatwork

31,277

In case of tail, I would rather count the time to do both remove the first line and replace bigfile.txt with bigfile.new. – rozcietrzewiacz Nov 29 '11 at 14:30
@rozcietrzewiacz, your point is correct. Thank you. Updated. – manatwork Nov 29 '11 at 14:57
This is really cool! I did the same with awk and got the following result - `[jaypal:~/Temp] seq 1000000 > bigfile.txt [jaypal:~/Temp] time awk 'NR>1 {print}' bigfile.txt >newfile.txt
real 0m0.649s user 0m0.601s sys 0m0.033s`
– jaypal singh Nov 29 '11 at 20:29
1

@Jaypal, I added your code to the list of alternatives. On my machine it was even faster. Strange, I expected awk's performance to be closer to sed's. (Note to myself: never expect – test instead.) – manatwork Nov 30 '11 at 07:49
This was the best solution in my case: tail -n +2 bigfile.txt > bigfile.new && mv -f bigfile.new bigfile.txt; I am using a single file with a lock to keep track of a single task list used by multiple processes. I started with what the initial poster used: sed -i 1d large_file. That was causing the file to lock for 1-2 seconds. The tail/mv combo completes almost instantaneously. Thank you! – Chris Adams Apr 11 '17 at 12:50

score 7 · Answer 2 · answered Nov 29 '11 at 10:35

7

There is no way to efficiently remove things from the start of a file. Removing data from the beginning requires re-writing the whole file.

Truncating from the end of a file can be very quick though (the OS only has to adjust the file size information, possibly clearing up now-unused blocks). This is not generally possible when you try to remove from the head of a file.

It could theoretically be "fast" if you removed a whole block/extent exactly, but there are no system calls for that, so you'd have to rely on filesystem-specific semantics (if such exist). (Or having some form of offset inside the first block/extent to mark the real start of file, I guess. Never heard of that either.)

answered Nov 29 '11 at 10:35

Mat

52,586

If the file is very large, I/O overhead is likely to be (possibly much) greater than the CPU overhead required to process end of lines. – Mat Nov 29 '11 at 10:46
You are right. However there could be difference in the way the tools access the file content. The best is not processing line by line when not necessary or at least not reading line by line when not necessary. – manatwork Nov 29 '11 at 10:56
2

I'm surprised the difference is so big in your results, and can reproduce it with that file size here. Benefits seem to decrease as the file size increases though (tried with seq 10M, 15s for sed, 5s for ed). Good tips anyway (+1). – Mat Nov 29 '11 at 11:13
Starting with version 3.15, Linux now has an API to collapse parts of a file on some extent based file systems, but at least for ext4 that can only be done on full blocks (usually 4k). – Stéphane Chazelas Nov 24 '14 at 13:20
Even if editing requires re-writing the entire file, it's sometimes very handy to have command-line tools to efficiently edit. In my case, this helped when I had to remove the first line of a file that was larger than my total system RAM. – Jason Jun 01 '17 at 05:10

score 3 · Answer 3 · edited Feb 17 '17 at 00:04

The most efficient method, don't do it ! If you do, in any case, you need twice the 'large' space on disk, and you waste IOs.

If you are stuck with a large file that you want to read without the 1st line, wait until you need to read it for removing the 1st line. If you need to send the file from stdin to a program, use tail to do it:

tail -n +2 | your_program

When you need to read the file, you can take the opportunity to remove the 1st line, but only if you have the needed space on disk:

tail -n +2 | tee large_file2 | your_program

If you can't read from stdin, use a fifo:

mkfifo large_file_wo_1st_line
tail -n +2 large_file > large_file_wo_1st_line&
your_program -i large_file_wo_1st_line

of even better if you are using bash, take advantage of process substitution:

your_program -i <(tail -n +2 large_file)

If you need seeking in the file, I do not see a better solution than not getting stuck with the file in the first place. If this file was generated by stdout:

large_file_generator | tail -n +2 > large_file

Else, there is always the fifo or process substitution solution:

mkfifo large_file_with_1st_file
large_file_generator -o large_file_with_1st_file&
tail -n +2 large_file_with_1st_file > large_file_wo_1st_file

large_file_generator -o >(tail -n 2+ > large_file_wo_1st_file)

score 1 · Answer 4 · answered Apr 17 '16 at 02:40

1

You can use Vim in Ex mode:

ex -sc '1d|x' large_file

1 select first line
d delete
x save and close

answered Apr 17 '16 at 02:40

Zombo

1
5
44
63

score 0 · Answer 5 · answered Nov 29 '11 at 16:14

This is just theorizing, but...

A custom filesystem (implemented using FUSE or a similar mechanism) could expose a directory whose contents are exactly the same as an already existing directory somewhere else, but with files truncated as you wish. The filesystem would translate all the file offsets. Then you wouldn't have to do a time-consuming rewriting of a file.

But given that this idea is very non-trivial, unless you've got tens of terabytes of such files, implementing such filesystem would be too expensive/time-consuming to be practical.

Efficient in-place header removing for large files using sed?

5 Answers5

Linked