10

A related question is here.

I often have to edit a large file by removing a few lines from the middle of it. I know which lines I wish to remove and I typically do the following:

sed "linenum1,linenum2 d" input.txt > input.temp

or in-line by adding the -i option. Since I know the line numbers, is there a command to avoid stream-editing and just remove the particular lines? input.txt can be as large as 50 GB.

11 Answers11

13

What you could do to avoid writing a copy of the file is to write the file over itself like:

{
  sed "$l1,$l2 d" < file
  perl -le 'truncate STDOUT, tell STDOUT'
} 1<> file

Dangerous as you've no backup copy there.

Or avoiding sed, stealing part of manatwork's idea:

{
  head -n "$(($l1 - 1))"
  head -n "$(($l2 - $l1 + 1))" > /dev/null
  cat
  perl -le 'truncate STDOUT, tell STDOUT'
} < file 1<> file

That could still be improved because you're overwriting the first l1 - 1 lines over themselves while you don't need to, but avoiding it would mean a bit more involved programming, and for instance do everything in perl which may end up less efficient:

perl -ne 'BEGIN{($l1,$l2) = ($ENV{"l1"}, $ENV{"l2"})}
    if ($. == $l1) {$s = tell(STDIN) - length; next}
    if ($. == $l2) {seek STDOUT, $s, 0; $/ = \32768; next}
    if ($. > $l2) {print}
    END {truncate STDOUT, tell STDOUT}' < file 1<> file

Some timings for removing lines 1000000 to 1000050 from the output of seq 1e7:

  • sed -i "$l1,$l2 d" file: 16.2s
  • 1st solution: 1.25s
  • 2nd solution: 0.057s
  • 3rd solution: 0.48s

They all work on the same principle: we open two file descriptors to the file, one in read-only mode (0) using < file short for 0< file and one in read-write mode (1) using 1<> file (<> file would be 0<> file). Those file descriptors point to two open file descriptions that will have each a current cursor position within the file associated with them.

In the second solution for instance, the first head -n "$(($l1 - 1))" will read $l1 - 1 lines worth of data from fd 0 and write that data to fd 1. So at the end of that command, the cursor on both open file descriptions associated with fds 0 and 1 will be at the start of the $l1th line.

Then, in head -n "$(($l2 - $l1 + 1))" > /dev/null, head will read $l2 - $l1 + 1 lines from the same open file description through its fd 0 which is still associated to it, so the cursor on fd 0 will move to the beginning of the line after the $l2 one.

But its fd 1 has been redirected to /dev/null, so upon writing to fd 1, it will not move the cursor in the open file description pointed to by {...}'s fd 1.

So, upon starting cat, the cursor on the open file description pointed to by fd 0 will be at the start of the next line after $l2, while the cursor on fd 1 will still be at the beginning of the $l1th line. Or said otherwise, that second head will have skipped those lines to remove on input but not on output. Now cat will overwrite the $l1th line with the next line after $l2 and so on.

cat will return when it reaches the end of file on fd 0. But fd 1 will point to somewhere in the file that has not been overwritten yet. That part has to go away, it corresponds to the space occupied by the deleted lines now shifted to the end of the file. What we need is to truncate the file at the exact location where that fd 1 points to now.

That's done with the ftruncate system call. Unfortunately, there's no standard Unix utility to do that, so we resort on perl. tell STDOUT gives us the current cursor position associated with fd 1. And we truncate the file at that offset using perl's interface to the ftruncate system call: truncate.

In the third solution, we replace the writing to fd 1 of the first head command with one lseek system call.

  • I don't get nearly as dramatic an improvement as you do. I get about 2 times better performance with your second solution. Good enough for the answer. – LasEspuelas Mar 04 '13 at 02:06
  • Off topic, can you explain solution 2 briefly? Or point to where I can learn? Is file fed to each command and each command operates from where the previous one ended? – LasEspuelas Mar 04 '13 at 02:09
  • @sturgman, the reason why I'm getting such improvements is probably because all the data is in cache (memory) for me. I've added some details to my answer. – Stéphane Chazelas Mar 04 '13 at 18:14
  • Nice! I seem to be able to replace perl with GNU truncate, for example { head -n "$(($l1 - 1))"; c=$(head -n "$(($l2 - $l1 + 1))" | wc -m); cat; truncate -s -$c file; } <file 1<>file. I can't see a way to avoid passing the actual file name to truncate though. Do you see any issues with this? Thanks – iruvar Sep 22 '14 at 12:59
7

Using sed is a good approach: It is clear, it streams the file (no problem with long files), and can easily be generalized to do more. But if you want a simple way to edit the file in-place, the easiest thing is to use ed or ex:

(echo 10,31d; echo wq) | ed input.txt

A better approach, guaranteed to work with files of unlimited size (and for lines as long as your RAM allows) is the following perl one-liner which edits the file in place:

perl -n -i -e 'print if $. < 10 || $. > 31' input.txt

Explanation:

-n: Apply the script to each line. Produce no other output.
-i: Edit the file in-place (use -i.bck to make a backup).
-e ...: Print each line, except lines 10 to 31.

alexis
  • 5,759
  • Nice trick! It's also possible to count lines from the end, for instance (echo '$-4,$d'; echo wq ) | ed input.txt deletes the last 5 lines. See also the line addressing section of the manual. – Nemo Mar 10 '15 at 06:48
  • 2
    However, I think ex (elvis 2.2.0) is better for large files: CPU-bound, few KB of memory used. ed (1.6) run out of memory (over 2 GB memory used) over my 9 GB file. Which explains why you said to use sed or perl for files of unlimited size. ;-) – Nemo Mar 10 '15 at 12:20
  • Good point, ex is a more robust line-oriented editor, and it accepts a superset of ed commands (in particular, it understands 10,31d and wq as given above). – alexis Oct 10 '19 at 09:34
  • I was optimistic to see this answer but ran into the same problem as https://unix.stackexchange.com/questions/66730/is-there-a-faster-way-to-remove-a-line-given-a-line-number-from-a-file?noredirect=1&lq=1#comment1211116_156694 – Ryan Apr 22 '21 at 20:37
  • Actually, since my file is only 1.4 G, I then tried your first approach, and that worked. +1. Thanks! – Ryan Apr 22 '21 at 20:48
1

You can use Vim in Ex mode:

ex -sc '1d2|x' input.txt
  1. 1 move to first line

  2. 2 select 2 lines

  3. d delete

  4. x save and close

Zombo
  • 1
  • 5
  • 44
  • 63
1

In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread), or many lines which all have the same unique content.

Here is an example

grep -v "content of lines to delete" input.txt > input.tmp

I have made some benchmarks with a file which contains 340 000 lines. The way with grep seems to be around 15 times faster than the sed method in this case.

Here are the commands and the timings:

time sed -i "/CDGA_00004.pdbqt.gz.tar/d" /tmp/input.txt

real    0m0.711s
user    0m0.179s
sys     0m0.530s

time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt

real    0m0.105s
user    0m0.088s
sys     0m0.016s

time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt )

real    0m0.046s
user    0m0.014s
sys     0m0.019s

I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.

Jadzia
  • 151
  • 2
    while this is good information, the Question here does know the line numbers, and so I don't think this appropriately answers the question – Jeff Schaller Mar 19 '17 at 12:56
  • Yes, it is a special case of asked question as mentioned. The other thread I mentioned which was marked as duplicate only wants to delete a single, and if the line is unique (and known) then my suggested solution is applicable. – Jadzia Mar 19 '17 at 13:04
1

If you need to read and write 50GiB, that will take a long time, regardless what you do. And unless the lines are of fixed length, or you have some other way to know where the lines to be deleted are, there is no way around reading the file up to the last line to be deleted. Maybe a custom program that just counts newlines and later copies full blocks is a bit faster than sed(1), but I believe that is not your bottleneck. Try using time(1) to find out how the time is aportioned.

vonbrand
  • 18,253
  • Afaik, you need to read (and write) everything after the last line to be deleted, since the offset of what follows the deletion will change. E.g., to delete the first line of a 50GB file, you'll need to read and write out everything that remains. (At least, it's possible to do it in place if you're careful...) – alexis Oct 10 '19 at 09:37
1

Would this help?

perl -e '
           $num1 = 5;
           $num2= 10000;
           open IN,"<","input_file.txt";
           open OUT,">","output_file.txt";
           print OUT <IN> for (1 .. $num1-1)
           <IN> for ($num1 .. $num2);
           undef $/ and print OUT <IN>;
           close IN;
           close OUT;
          '

This removes any lines between 5 and 10000 inclusive. Change the numbers to fit your needs. Can't see an efficient way of doing it in situ, though (i.e. this approach will have to print to a different output file).

Joseph R.
  • 39,549
1

If you want to edit the file in place, most shell tools won't help you because when you open a file for writing, you only have a choice of truncating it (>) or appending to it (>>), not overwriting existing contents. dd is a notable exception. See Is there a way to modify a file in-place?

export LC_ALL=C
lines_to_keep=$((linenum1 - 1))
lines_to_skip=$((linenum2 - linenum1 + 1))
deleted_bytes=$({ { head -n "$lines_to_keep"
                    head -n "$lines_to_skip" >&3;
                    cat
                  } <big_file | dd of=big_file conv=notrunc;
                } 3>&1 | wc -c)
dd if=/dev/null of=big_file bs=1 seek="$(($(wc -c <big_file) - $deleted_bytes))"

(Warning: untested!)

1

Using Raku (formerly known as Perl_6)

Code to remove lines 1000000 to 1000050 from the output of seq 1e7:

FASTEST EXAMPLE:

~$ seq 1e7 > seq_1e7.txt
~$ raku -e 'lines[0..999999].join("\n").put; lines.skip(50).join("\n").put'  < seq_1e7.txt 1<> seq_1e7_modified.txt

#OR (similar speed)

~$ seq 1e7 > seq_1e7.txt ~$ raku -e 'lines.head(1000000).join("\n").put; lines.tail(*-1000050).join("\n").put;' < seq_1e7.txt 1<> seq_1e7_modified.txt

Timing (MacOS, M2 Max) typical run:

6.20s user 1.41s system 101% cpu 7.525 total

Raku doesn't have Perl's (or sed's) -i "in-place" command line flag, so the code above uses the <> Bourne/POSIX shell redirection operator as described by @StéphaneChazelas. Make backups always, and write to a tmp file whenever possible! See @StéphaneChazelas' excellent post for a precise explanation of this < file 1<> file code.


SLOWER EXAMPLES:

I expected the first codeblock below to be as fast as above, but it's about 3.2X slower (total time vs. total time):

~$ raku -e '.put for lines[0..999999]; .put for lines.skip(50)'  < seq_1e7.txt 1<> seq_1e7_modified.txt

The next examples are easy-to-read, but run 3.2X-4.6X slower than the code at top (total time vs. total time):

~$ raku -e 'lines[0..999999,1000049..*].join("\n").put;' <  seq_1e7.txt  1<> seq_1e7_modified.txt

#OR

~$ raku -ne '.put unless 1000000 < ++$ < 1000050;' < seq_1e7.txt 1<> seq_1e7_modified.txt

For the code example immediately above (2nd in block), you can see how this Raku code is mash-up of @alexis' excellent Perl answer (translated to Raku), with file-operations again taken directly from @StéphaneChazelas' excellent post. Note how in Raku there are two fundamental changes from Perl:

  1. In Raku the $. variable is gone, replaced by ++$ anonymous state variable (used to count line-numbers).
  2. Raku can do "chained" inequalities such as 1000000 < ++$ < 1000050, obviating the need for the || short-circuiting operator as seen in the Perl code.

Finally, if you need to only remove a single line, here's the Raku translation of @abligh's Perl code. It runs about as fast as the codeblock immediately above:

~$ raku -ne '.put  unless  ++$ == 1000000;' <  seq_1e7.txt  1<> seq_1e7_modified.txt

https://en.wikipedia.org/wiki/Inequality_(mathematics)#Chained_notation
https://unix.stackexchange.com/a/66746/227738
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
0

This is nice and simple:

perl -i -n -e 'print unless $.==13' /path/to/your/file

to remove e.g. line 13 from /path/to/your/file

abligh
  • 397
  • like GNU sed (GNU sed borrowed -i from perl), that writes the output into a second file (by the same name) so is not going to be significantly faster. – Stéphane Chazelas Nov 24 '14 at 14:25
  • This looked appealing, but when I ran it on my Macbook Pro, I got error: Can't open perl script "print unless $.==13": No such file or directory even though I could prove that the file exists by running ls -lah /path/to/your/file. – Ryan Apr 22 '21 at 20:35
  • Depending on your perl version, you might want perl -i -n -e 'print unless $.==13' /path/to/your/file - I will amend – abligh Apr 22 '21 at 22:35
-1

Note that this is a reply to a different question that was marked a duplicate.

The question was hot to remove line 4125889 from in.csv.

You can either do things unsafe - then you may be fast but may loose the whole file, or you depend on the speed of the editor you are using.

I recommend:

echo '\0013\0003y' | VED_FTMPFIR=. ved +4125878 in.csv

where you need 3x the file size and end with in.csv and in.csv.bak

or:

echo '\0013\0003!' | VED_FTMPFIR=. ved +4125878 in.csv

where you need 2x the file size and the resulting file will be written in place.

Note that you need a POSIX compliant shell (echo) implementation to get the escapes properly expanded. The editor ved is part of the schily tools and available at:

http://sourceforge.net/projects/schilytools/files/

in schily-*.tar.bz2

It uses the fastest swap file mechanism I am aware of.

The VED_FTMPFIR=. environment sets the directory for the swapfile to the current directory. select any directory that holds sufficient space.

schily
  • 19,173
  • The behaviour of echo '\0013\0003!' is unspecified by POSIX. Posixly, you'd write printf '\13\3!\n'. – Stéphane Chazelas Oct 02 '15 at 14:36
  • Do you know of a single system that intentionally decided not to be XSI compliant? – schily Oct 02 '15 at 14:48
  • most Linux/GNU-based Unix-like software distributions only seek (without committing to) POSIX conformance, and only follow XSI unless that would break backward compatibility (like would be for echo). Same of FreeBSD. Between themselves, that probably constitutes over 90% of the audience of this Q&A site, so they can't be ignored. Best it to give up altogether on echo. That command is beyond hope to be made portable. – Stéphane Chazelas Oct 02 '15 at 15:16
  • I am not aware of a single Linux distro that seeks POSIX compliance. The Linux folks decided that they do neither actively collaborate in the POSIX process nor try to follow existing POSIX standards. They have been given the chance to get a POSIX certification for 1 $ and Andrew Josey (OpenGroup chair) spend a lot of time to help the Linux people with the certification and related fixes but at some time they did stop any related activity. BTW: The situation with echo is a result of the implementation in bash that is a result of the unwillingness of the FSF to follow existing standards. – schily Oct 02 '15 at 16:47
  • There's little benefit for a Linux+other free software distribution vendor to get certified and anyway it would be hard to achieve because the software is developed by 3rd parties anyway. But there's a lot of benefit in being POSIX conformant (or at least to agree on one standard, and the opensource community has so far failed to come up with a compelling alternative to POSIX), to ease interoperability and you see most core software maintainers aiming at that. For echo (a lost cause), blaming bash is wrong since bash is conformant when in the right environment (and is certified via OS/X) – Stéphane Chazelas Oct 02 '15 at 18:29
-1

you could add a *q*uit instruction to your sed command whenn linenum2 is reached, so sed stops processing the file.

sed 'linenum1,linenum2d;linenum2q' file
watael
  • 911