Is there a faster way to remove a line (given a line number) from a file?

Question

A related question is here.

I often have to edit a large file by removing a few lines from the middle of it. I know which lines I wish to remove and I typically do the following:

sed "linenum1,linenum2 d" input.txt > input.temp

or in-line by adding the -i option. Since I know the line numbers, is there a command to avoid stream-editing and just remove the particular lines? input.txt can be as large as 50 GB.

Somehow uglier, but may perform better: head -$((linenum1-1)) input.txt > input.temp; tail -n +$((linenum2+1)) input.txt >> input.temp. — manatwork, Mar 03 '13 at 17:38
@jordanm, it definitely is much slower (it has to gulp down the whole file, and set up the data structures it uses to represent the file in memory). — vonbrand, Mar 03 '13 at 18:40
@manatwork, I believe the processing is mostly inconsequential against reading/writing 50GiB... — vonbrand, Mar 03 '13 at 18:41
Some stats on the opposite (extracting lines from the middle) suggest sed will be your best bet. — Kevin, Mar 03 '13 at 20:46
@manatwork it takes approximately double the time than sed with the head|tail combination for me. Unless I am doing the timings wrong. — LasEspuelas, Mar 04 '13 at 00:37
sed reads through the data once. With head|tail the whole data stream must be copied twice (once by each process). — alexis, Mar 04 '13 at 17:51

Stéphane Chazelas · Accepted Answer · 2013-03-04T18:11:17.997

What you could do to avoid writing a copy of the file is to write the file over itself like:

{
  sed "$l1,$l2 d" < file
  perl -le 'truncate STDOUT, tell STDOUT'
} 1<> file

Dangerous as you've no backup copy there.

Or avoiding sed, stealing part of manatwork's idea:

{
  head -n "$(($l1 - 1))"
  head -n "$(($l2 - $l1 + 1))" > /dev/null
  cat
  perl -le 'truncate STDOUT, tell STDOUT'
} < file 1<> file

That could still be improved because you're overwriting the first l1 - 1 lines over themselves while you don't need to, but avoiding it would mean a bit more involved programming, and for instance do everything in perl which may end up less efficient:

perl -ne 'BEGIN{($l1,$l2) = ($ENV{"l1"}, $ENV{"l2"})}
    if ($. == $l1) {$s = tell(STDIN) - length; next}
    if ($. == $l2) {seek STDOUT, $s, 0; $/ = \32768; next}
    if ($. > $l2) {print}
    END {truncate STDOUT, tell STDOUT}' < file 1<> file

Some timings for removing lines 1000000 to 1000050 from the output of seq 1e7:

sed -i "$l1,$l2 d" file: 16.2s
1st solution: 1.25s
2nd solution: 0.057s
3rd solution: 0.48s

They all work on the same principle: we open two file descriptors to the file, one in read-only mode (0) using < file short for 0< file and one in read-write mode (1) using 1<> file (<> file would be 0<> file). Those file descriptors point to two open file descriptions that will have each a current cursor position within the file associated with them.

In the second solution for instance, the first head -n "$(($l1 - 1))" will read $l1 - 1 lines worth of data from fd 0 and write that data to fd 1. So at the end of that command, the cursor on both open file descriptions associated with fds 0 and 1 will be at the start of the $l1th line.

Then, in head -n "$(($l2 - $l1 + 1))" > /dev/null, head will read $l2 - $l1 + 1 lines from the same open file description through its fd 0 which is still associated to it, so the cursor on fd 0 will move to the beginning of the line after the $l2 one.

But its fd 1 has been redirected to /dev/null, so upon writing to fd 1, it will not move the cursor in the open file description pointed to by {...}'s fd 1.

So, upon starting cat, the cursor on the open file description pointed to by fd 0 will be at the start of the next line after $l2, while the cursor on fd 1 will still be at the beginning of the $l1th line. Or said otherwise, that second head will have skipped those lines to remove on input but not on output. Now cat will overwrite the $l1th line with the next line after $l2 and so on.

cat will return when it reaches the end of file on fd 0. But fd 1 will point to somewhere in the file that has not been overwritten yet. That part has to go away, it corresponds to the space occupied by the deleted lines now shifted to the end of the file. What we need is to truncate the file at the exact location where that fd 1 points to now.

That's done with the ftruncate system call. Unfortunately, there's no standard Unix utility to do that, so we resort on perl. tell STDOUT gives us the current cursor position associated with fd 1. And we truncate the file at that offset using perl's interface to the ftruncate system call: truncate.

In the third solution, we replace the writing to fd 1 of the first head command with one lseek system call.

I don't get nearly as dramatic an improvement as you do. I get about 2 times better performance with your second solution. Good enough for the answer. — LasEspuelas, Mar 04 '13 at 02:06
Off topic, can you explain solution 2 briefly? Or point to where I can learn? Is file fed to each command and each command operates from where the previous one ended? — LasEspuelas, Mar 04 '13 at 02:09
@sturgman, the reason why I'm getting such improvements is probably because all the data is in cache (memory) for me. I've added some details to my answer. — Stéphane Chazelas, Mar 04 '13 at 18:14
Nice! I seem to be able to replace perl with GNU truncate, for example { head -n "$(($l1 - 1))"; c=$(head -n "$(($l2 - $l1 + 1))" | wc -m); cat; truncate -s -$c file; } <file 1<>file. I can't see a way to avoid passing the actual file name to truncate though. Do you see any issues with this? Thanks — iruvar, Sep 22 '14 at 12:59

alexis · Answer 2 · 2013-03-04T18:39:48.640

7

Using sed is a good approach: It is clear, it streams the file (no problem with long files), and can easily be generalized to do more. But if you want a simple way to edit the file in-place, the easiest thing is to use ed or ex:

(echo 10,31d; echo wq) | ed input.txt

A better approach, guaranteed to work with files of unlimited size (and for lines as long as your RAM allows) is the following perl one-liner which edits the file in place:

perl -n -i -e 'print if $. < 10 || $. > 31' input.txt

Explanation:

-n: Apply the script to each line. Produce no other output.
-i: Edit the file in-place (use -i.bck to make a backup).
-e ...: Print each line, except lines 10 to 31.

edited Mar 04 '13 at 18:39

answered Mar 04 '13 at 17:58

alexis

5,759

Nice trick! It's also possible to count lines from the end, for instance (echo '$-4,$d'; echo wq ) | ed input.txt deletes the last 5 lines. See also the line addressing section of the manual. – Nemo Mar 10 '15 at 06:48
2

However, I think ex (elvis 2.2.0) is better for large files: CPU-bound, few KB of memory used. ed (1.6) run out of memory (over 2 GB memory used) over my 9 GB file. Which explains why you said to use sed or perl for files of unlimited size. ;-) – Nemo Mar 10 '15 at 12:20
Good point, ex is a more robust line-oriented editor, and it accepts a superset of ed commands (in particular, it understands 10,31d and wq as given above). – alexis Oct 10 '19 at 09:34
I was optimistic to see this answer but ran into the same problem as https://unix.stackexchange.com/questions/66730/is-there-a-faster-way-to-remove-a-line-given-a-line-number-from-a-file?noredirect=1&lq=1#comment1211116_156694 – Ryan Apr 22 '21 at 20:37
Actually, since my file is only 1.4 G, I then tried your first approach, and that worked. +1. Thanks! – Ryan Apr 22 '21 at 20:48

Zombo · Answer 3 · 2016-04-17T14:48:17.340

1

You can use Vim in Ex mode:

ex -sc '1d2|x' input.txt

1 move to first line
2 select 2 lines
d delete
x save and close

edited Apr 17 '16 at 14:48

answered Apr 17 '16 at 06:30

Zombo

1
5
44
63

1

This failed to work on a file size 11 GB. – Vishwanath Dalvi Apr 27 '18 at 17:28

score 1 · Answer 4 · edited Apr 13 '17 at 12:36

In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread), or many lines which all have the same unique content.

Here is an example

grep -v "content of lines to delete" input.txt > input.tmp

I have made some benchmarks with a file which contains 340 000 lines. The way with grep seems to be around 15 times faster than the sed method in this case.

Here are the commands and the timings:

time sed -i "/CDGA_00004.pdbqt.gz.tar/d" /tmp/input.txt

real    0m0.711s
user    0m0.179s
sys     0m0.530s

time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt

real    0m0.105s
user    0m0.088s
sys     0m0.016s

time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )

real    0m0.046s
user    0m0.014s
sys     0m0.019s

I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.

while this is good information, the Question here does know the line numbers, and so I don't think this appropriately answers the question — Jeff Schaller, Mar 19 '17 at 12:56
Yes, it is a special case of asked question as mentioned. The other thread I mentioned which was marked as duplicate only wants to delete a single, and if the line is unique (and known) then my suggested solution is applicable. — Jadzia, Mar 19 '17 at 13:04

score 1 · Answer 5 · answered Mar 03 '13 at 18:46

1

If you need to read and write 50GiB, that will take a long time, regardless what you do. And unless the lines are of fixed length, or you have some other way to know where the lines to be deleted are, there is no way around reading the file up to the last line to be deleted. Maybe a custom program that just counts newlines and later copies full blocks is a bit faster than sed(1), but I believe that is not your bottleneck. Try using time(1) to find out how the time is aportioned.

answered Mar 03 '13 at 18:46

vonbrand

18,253

Afaik, you need to read (and write) everything after the last line to be deleted, since the offset of what follows the deletion will change. E.g., to delete the first line of a 50GB file, you'll need to read and write out everything that remains. (At least, it's possible to do it in place if you're careful...) – alexis Oct 10 '19 at 09:37

score 1 · Answer 6 · edited Mar 03 '13 at 20:34

1

Would this help?

perl -e '
           $num1 = 5;
           $num2= 10000;
           open IN,"<","input_file.txt";
           open OUT,">","output_file.txt";
           print OUT <IN> for (1 .. $num1-1)
           <IN> for ($num1 .. $num2);
           undef $/ and print OUT <IN>;
           close IN;
           close OUT;
          '

This removes any lines between 5 and 10000 inclusive. Change the numbers to fit your needs. Can't see an efficient way of doing it in situ, though (i.e. this approach will have to print to a different output file).

edited Mar 03 '13 at 20:34

Stéphane Chazelas

544,893

answered Mar 03 '13 at 19:13

Joseph R.

39,549

1

undef $/ means the rest of the file will be slurped in memory which for a 50 GB file is probably not a good idea, and probably won't be much more efficient than say using $/ = \32768 and a loop. – Stéphane Chazelas Mar 03 '13 at 20:33
Run perl with the -i flag and it will do it in situ for you! – alexis Mar 04 '13 at 17:59
@alexis: not sure how that would go. – Joseph R. Mar 04 '13 at 18:03
1

See my answer for an example. – alexis Mar 04 '13 at 18:10

score 1 · Answer 7 · edited Apr 13 '17 at 12:37

1

If you want to edit the file in place, most shell tools won't help you because when you open a file for writing, you only have a choice of truncating it (>) or appending to it (>>), not overwriting existing contents. dd is a notable exception. See Is there a way to modify a file in-place?

export LC_ALL=C
lines_to_keep=$((linenum1 - 1))
lines_to_skip=$((linenum2 - linenum1 + 1))
deleted_bytes=$({ { head -n "$lines_to_keep"
                    head -n "$lines_to_skip" >&3;
                    cat
                  } <big_file | dd of=big_file conv=notrunc;
                } 3>&1 | wc -c)
dd if=/dev/null of=big_file bs=1 seek="$(($(wc -c <big_file) - $deleted_bytes))"

(Warning: untested!)

edited Apr 13 '17 at 12:37

Community

1

answered Mar 04 '13 at 00:32

Gilles 'SO- stop being evil'

829,060

Is this bash? It will probably be very different if I am working in zsh. I should probably switch to bash to test all of these suggestions out. – LasEspuelas Mar 04 '13 at 00:39
2

@sturgman This snippet is POSIX. It'll work in zsh just as well as in ash, bash or ksh. – Gilles 'SO- stop being evil' Mar 04 '13 at 00:46
1

You're forgetting about the <> Bourne/POSIX shell redirection operator (see my answer) – Stéphane Chazelas Mar 04 '13 at 18:16

jubilatious1 · Answer 8 · 2023-06-22T19:09:42.860

Using Raku (formerly known as Perl_6)

Code to remove lines 1000000 to 1000050 from the output of seq 1e7:

FASTEST EXAMPLE:

~$ seq 1e7 > seq_1e7.txt
~$ raku -e 'lines[0..999999].join("\n").put; lines.skip(50).join("\n").put'  < seq_1e7.txt 1<> seq_1e7_modified.txt
#OR (similar speed)
~$ seq 1e7 > seq_1e7.txt
~$ raku -e 'lines.head(1000000).join("\n").put; lines.tail(*-1000050).join("\n").put;'  < seq_1e7.txt 1<> seq_1e7_modified.txt

Timing (MacOS, M2 Max) typical run:

6.20s user 1.41s system 101% cpu 7.525 total

Raku doesn't have Perl's (or sed's) -i "in-place" command line flag, so the code above uses the <> Bourne/POSIX shell redirection operator as described by @StéphaneChazelas. Make backups always, and write to a tmp file whenever possible! See @StéphaneChazelas' excellent post for a precise explanation of this < file 1<> file code.

SLOWER EXAMPLES:

I expected the first codeblock below to be as fast as above, but it's about 3.2X slower (total time vs. total time):

~$ raku -e '.put for lines[0..999999]; .put for lines.skip(50)'  < seq_1e7.txt 1<> seq_1e7_modified.txt

The next examples are easy-to-read, but run 3.2X-4.6X slower than the code at top (total time vs. total time):

~$ raku -e 'lines[0..999999,1000049..*].join("\n").put;' <  seq_1e7.txt  1<> seq_1e7_modified.txt
#OR
~$ raku -ne '.put  unless  1000000 < ++$ < 1000050;' <  seq_1e7.txt  1<> seq_1e7_modified.txt

For the code example immediately above (2nd in block), you can see how this Raku code is mash-up of @alexis' excellent Perl answer (translated to Raku), with file-operations again taken directly from @StéphaneChazelas' excellent post. Note how in Raku there are two fundamental changes from Perl:

In Raku the $. variable is gone, replaced by ++$ anonymous state variable (used to count line-numbers).
Raku can do "chained" inequalities such as 1000000 < ++$ < 1000050, obviating the need for the || short-circuiting operator as seen in the Perl code.

Finally, if you need to only remove a single line, here's the Raku translation of @abligh's Perl code. It runs about as fast as the codeblock immediately above:

~$ raku -ne '.put  unless  ++$ == 1000000;' <  seq_1e7.txt  1<> seq_1e7_modified.txt

https://en.wikipedia.org/wiki/Inequality_(mathematics)#Chained_notation
https://unix.stackexchange.com/a/66746/227738
https://raku.org

abligh · Answer 9 · 2021-04-22T22:35:15.257

0

This is nice and simple:

perl -i -n -e 'print unless $.==13' /path/to/your/file

to remove e.g. line 13 from /path/to/your/file

edited Apr 22 '21 at 22:35

answered Sep 21 '14 at 17:07

abligh

397

like GNU sed (GNU sed borrowed -i from perl), that writes the output into a second file (by the same name) so is not going to be significantly faster. – Stéphane Chazelas Nov 24 '14 at 14:25
This looked appealing, but when I ran it on my Macbook Pro, I got error: Can't open perl script "print unless $.==13": No such file or directory even though I could prove that the file exists by running ls -lah /path/to/your/file. – Ryan Apr 22 '21 at 20:35
Depending on your perl version, you might want perl -i -n -e 'print unless $.==13' /path/to/your/file - I will amend – abligh Apr 22 '21 at 22:35

schily · Answer 10 · 2015-10-02T14:49:53.927

-1

Note that this is a reply to a different question that was marked a duplicate.

The question was hot to remove line 4125889 from in.csv.

You can either do things unsafe - then you may be fast but may loose the whole file, or you depend on the speed of the editor you are using.

I recommend:

echo '\0013\0003y' | VED_FTMPFIR=. ved +4125878 in.csv

where you need 3x the file size and end with in.csv and in.csv.bak

or:

echo '\0013\0003!' | VED_FTMPFIR=. ved +4125878 in.csv

where you need 2x the file size and the resulting file will be written in place.

Note that you need a POSIX compliant shell (echo) implementation to get the escapes properly expanded. The editor ved is part of the schily tools and available at:

http://sourceforge.net/projects/schilytools/files/

in schily-*.tar.bz2

It uses the fastest swap file mechanism I am aware of.

The VED_FTMPFIR=. environment sets the directory for the swapfile to the current directory. select any directory that holds sufficient space.

edited Oct 02 '15 at 14:49

answered Oct 02 '15 at 14:26

schily

19,173

The behaviour of echo '\0013\0003!' is unspecified by POSIX. Posixly, you'd write printf '\13\3!\n'. – Stéphane Chazelas Oct 02 '15 at 14:36
Do you know of a single system that intentionally decided not to be XSI compliant? – schily Oct 02 '15 at 14:48
most Linux/GNU-based Unix-like software distributions only seek (without committing to) POSIX conformance, and only follow XSI unless that would break backward compatibility (like would be for echo). Same of FreeBSD. Between themselves, that probably constitutes over 90% of the audience of this Q&A site, so they can't be ignored. Best it to give up altogether on echo. That command is beyond hope to be made portable. – Stéphane Chazelas Oct 02 '15 at 15:16
I am not aware of a single Linux distro that seeks POSIX compliance. The Linux folks decided that they do neither actively collaborate in the POSIX process nor try to follow existing POSIX standards. They have been given the chance to get a POSIX certification for 1 $ and Andrew Josey (OpenGroup chair) spend a lot of time to help the Linux people with the certification and related fixes but at some time they did stop any related activity. BTW: The situation with echo is a result of the implementation in bash that is a result of the unwillingness of the FSF to follow existing standards. – schily Oct 02 '15 at 16:47
There's little benefit for a Linux+other free software distribution vendor to get certified and anyway it would be hard to achieve because the software is developed by 3rd parties anyway. But there's a lot of benefit in being POSIX conformant (or at least to agree on one standard, and the opensource community has so far failed to come up with a compelling alternative to POSIX), to ease interoperability and you see most core software maintainers aiming at that. For echo (a lost cause), blaming bash is wrong since bash is conformant when in the right environment (and is certified via OS/X) – Stéphane Chazelas Oct 02 '15 at 18:29

score -1 · Answer 11 · answered Mar 03 '13 at 17:02

-1

you could add a *q*uit instruction to your sed command whenn linenum2 is reached, so sed stops processing the file.

sed 'linenum1,linenum2d;linenum2q' file

answered Mar 03 '13 at 17:02

watael

911

This will improve things proportionally to how close is linenum2 to the beginning of the file. Thanks! – LasEspuelas Mar 03 '13 at 17:05
6

This improves nothing: “d Delete pattern space. Start next cycle.” – man sed. That means, the q will never be executed. Which is good, because otherwise the entire remaining part would be removed. – manatwork Mar 03 '13 at 17:23
@manatwork you are right. Tried it out with no improvement in performance. – LasEspuelas Mar 03 '13 at 17:32
yes, of course. silly me. – watael Mar 03 '13 at 19:09
5

Note that you can delete your own answer if you realise it's a mistake. Nothing wrong with that, quite the contrary. – Stéphane Chazelas Mar 03 '13 at 20:29

Is there a faster way to remove a line (given a line number) from a file?

11 Answers11

Linked

Related