Efficiently remove first couple of lines from a text file

Question

head/tail will need to iterate almost the whole file (depending on the position of the line you give as parameter). Then you copy that result to a new file and delete the old one.
I am not sure if sed will be iterating the whole file but you need to copy that result to a new file and delete the old one. Even with -i (in place) it creates a temporary file under the hood, so the same thing applies.

Why not just move the pointer that points to the first line of the file and move it to the line that we want?

How could we do such a thing? Do I have to do in C? Is there other way?

Does that make sense ?? I am thinking wrong? If yes why?

Yeah, try doing it in C. Submit a patch to coreutils once you get it working. /s Does that make sense ? - Nope. You're confusing files with data structures. — Satō Katsura, May 17 '17 at 16:23
I see your sarcasm, and I know there is probably a wrong way of thinking from my part, but where am I thinking wrong ? — HashWizard, May 17 '17 at 16:24
@SatoKatsura When I say pointer, I don't necessarily mean the pointer you use in your programming language. Pointer as in 'the whatever thing that knows where the file starts and ends' are the pointer(s) I am talking about — HashWizard, May 17 '17 at 16:28
When you talk about pointers it sounds like you have a more complicated structure than a simple text file. Because there's no pointer (or reference, or handle, or anything like it) to the start of a text file.It just starts. Are you talking about the file system itself? — , May 17 '17 at 16:30
sed makes a copy, only if you do inplace with backup (not even with). — ctrl-alt-delor, May 17 '17 at 16:32
@DrEval Yes I am talking about the file system. EOF (End of File) and BOF (Beginning of File) are the pointers. — HashWizard, May 17 '17 at 16:33
sed makes a copy, only if you do inplace with backup (not even with).
head only reads first few lines and then quits. tail could read backwards (that is what I would do). — ctrl-alt-delor, May 17 '17 at 16:34
@richard but when you want to delete starting lines (or last lines) head would have to do extra leg work, essentially creating a new file where the lines you want deleted donnot exist. The point is that I am talking about head in the context of deleting something from a potentially large file — HashWizard, May 17 '17 at 16:35
Good answers to this question on stack overflow here: http://stackoverflow.com/questions/604864/print-a-file-skipping-x-lines-in-bash/12929067#12929067 — David Parks, May 17 '17 at 16:36
I think I have a bug in a plugin, and cant' delete my 1st two comments. Can you help. — ctrl-alt-delor, May 17 '17 at 16:37
Related (almost duplicate): Is there a faster way to remove a line (given a line number) from a file? — Stéphane Chazelas, May 18 '17 at 09:58

Gilles 'SO- stop being evil' · Answer 1 · 2019-11-30T21:35:15.040

Why not just move the pointer that points to the first line of the file and move it to the line that we want?

Because there's no such thing as a “pointer that points to the first line of the file”.

The basic operations to modify a file are: overwrite a range of bytes (i.e. replace a portion with data of the same length), append (i.e. add at the end), truncate (i.e. remove from the end).

Most filesystems store files in fixed-sized blocks, except that the last block may be partial. There's no way to modify the data in place if the modification would change the size of what is modified, unless the change is at the end or the modification would shift data by a whole number of blocks. Shifting data by a whole number of blocks would only work by coincidence, and there's no widespread interface¹ to do that.

The most efficient way to remove data at the beginning of a file is to copy the data that needs to be kept to a new file. Which is precisely what tail -n +42 or sed '41,$p' do.

¹ _{Modern Linux systems have a system call to remove a portion of a file: fallocate(fd, FALLOC_FL_COLLAPSE_RANGE, …), which you can call via the utility fallocate --collapse-range=…. There's also FALLOC_FL_INSERT_RANGE and --insert-range. But they are limited to blocks, which makes them mostly useless for text files, and they aren't available with all filesystems.}

@StéphaneChazelas From man fallocate: (-c option) Available since Linux 3.15 for ext4 (only for extent-based files) and XFS. (not for many filesystems yet). — , Nov 30 '19 at 21:18

G-Man Says 'Reinstate Monica' · Answer 2 · 2017-05-18T04:37:08.367

Gilles beat me to it: there is no “pointer that points to the first line of the file”. The first line of the file — the beginning of the file — is always the first character of the file. (There may be obscure, individual applications that recognize such a notion, but there’s nothing like this at the system level.)

What you already know:

Commands like

sed '1,6d' filename
sed -n '7,$p' filename
tail -n +7 filename

(and probably other variants) will write all but the first 6 lines of filename to the standard output. (They all, of course, read all of the file.) While we’re at it,

sed -n '1,6p' filename
sed '7,$d' filename
head -n 6 filename
sed '6q' filename

will write the first 6 lines of filename to the standard output. The first two might or might not read the entire file; the last two probably will not.

Also,

command input_filename > the_same_filename

doesn’t work, as discussed in Warning regarding “>”.

What you might not know:

command arguments    1<> filename

will open filename for reading and writing without truncating (clobbering) it. So,

sed '1,6d' filename  1<> the_same_filename

might be the first step in the solution you are looking for. This is probably as close as you’re going to come to removing the first M lines of a file “in place”; it will read the file and overwrite it concurrently, without creating another file. If M is small enough (or, specifically, if the number of bytes in the first M lines is small enough), this may read each block of the file once and write each block once — and you can’t do any better than that.

Just the first step?

I created this test file:

$ cat -n foo
     1  a
     2  bcd
     3  efghi
     4  jklmnop
     5  qrstuvwxy
     6  z0123456789
     7  ABCDEFGHIJKLM
     8  Once upon a midnight dreary, while I pondered, weak and weary,
     9  Over many a quaint and curious volume of forgotten lore—
    10  While I nodded, nearly napping, suddenly there came a tapping,
    11  As of some one gently rapping—rapping at my chamber door.
    12  "'Tis some visitor," I muttered, "tapping at my chamber door—
    13                                    Only this and nothing more."
    14  The quick brown
    15  fox jump over the
    16  lazy dog. Once upon
    17  this midnight dreary,

This file is painstaking constructed so that the lengths of the lines (including newlines) are 2, 4, 6, 8, 10, 12, 14, 63, 57, 63, 58, 62, 63, 16, 18, 20, and 22. Note that the first six lines therefore contain 2+4+6+8+10+12=42 bytes. The last two lines contain 20+22 bytes, which is coincidentally (!) also 42. (The total file size is 504.) So,

$ ls -l foo
-rw-r--r-- 1 myusername mygroupname 504 May 18 04:25 foo

$ sed '1,6d' foo 1<> foo

$ ls -l foo
-rw-r--r-- 1 myusername mygroupname 504 May 18 04:32 foo

$ cat -n foo
     1  ABCDEFGHIJKLM
     2  Once upon a midnight dreary, while I pondered, weak and weary,
     3  Over many a quaint and curious volume of forgotten lore—
     4  While I nodded, nearly napping, suddenly there came a tapping,
     5  As of some one gently rapping—rapping at my chamber door.
     6  "'Tis some visitor," I muttered, "tapping at my chamber door—
     7                                    Only this and nothing more."
     8  The quick brown
     9  fox jump over the
    10  lazy dog. Once upon
    11  this midnight dreary,
    12  lazy dog. Once upon
    13  this midnight dreary,

OK, good, the first six lines are gone. The original line number 7 (“ABCDEFGHIJKLM”) is now line number 1. But, what’s this? The file has gone from 17 lines to 13. It should be 11 (17−6). And the last two lines (“lazy dog … midnight dreary”) are there twice.

This is one of the pitfalls of the 1<> operator — if you don’t truncate the output file, you can’t end up with a file that’s smaller than the one you started with. Specifically, here, the output from sed '1,6d' foo is 462 bytes (504−42, since the first six lines contain 42 bytes), and so it overwrites the first 462 bytes of the output file — which is also foo. And the first 462 bytes of foo are all but the last 42 (504−462) — so the last two lines do not get overwritten. The two copies of the last two lines (“lazy dog … midnight dreary”) are one that’s the output from sed, followed by one that’s left over from the original contents of the file.

So, what next?

All we need to do now is to throw away the last 42 bytes of the file. As it happens, this can be done by just moving the pointer that points to the end of the file. OK, it’s not actually a pointer; it’s an integer file size — potAto, potAHto. For the past 20 or 30 years, Unix has allowed you to truncate a file to a desired size, leaving the data up to that point untouched, and discarding the data beyond that point.

An ancient command that will do this is

dd if=/dev/null bs=462 seek=1 of=foo 2> /dev/null

which copies /dev/null over foo, starting at byte 462. Yes, it’s somewhat of a kluge. A newer command that does this function is

truncate -s 462 foo

This might not be present on all systems; it is not specified by POSIX.

So, putting it all together,

#!/bin/sh
filename="$1"
bytes_to_remove=$(sed '6q' "$filename" | wc -c)
total_size=$(stat -c '%s' "$filename")
sed '1,6d' "$filename" 1<> "$filename"
new_size=$((total_size - bytes_to_remove))
truncate -s "$new_size" "$filename"

We use wc -c to count the characters in the first six lines (produced by sed '6q'), subtract that from the total file size, and truncate the file to that size. You can use any of the alternative commands to output the first M lines or the last N−M lines, and you can replace the last line with

dd if=/dev/null bs="$new_size" seek=1 of="$filename" 2> /dev/null

Caveats:

I haven’t tested this on files with

CR-LF line endings, or
multibyte characters,

and these might be problematic.

See also Is there a faster way to remove a line (given a line number) from a file? for possible ways to improve the 1<>-based approach. — Stéphane Chazelas, May 18 '17 at 09:57

score 0 · Answer 3 · answered May 17 '17 at 16:31

0

Looking at the source for tail, it does not in fact appear to iterate across the entire file. It starts at the end, and reads backwards until it sees the correct number of newlines (plus any cruft from a nonterminated line), notes that location, skips to that location, and dumps the file (or piped or inputted data) thenceforth.

answered May 17 '17 at 16:31

DopeGhoti

76,081

2

That's implementation-specific, and it happens only for seek()-able files. – Satō Katsura May 17 '17 at 16:34
That's only for numbering lines from the end like tail -n 7, not when numbering from the beginning like tail -n +7 to delete the first 6 lines where it wouldn't make sense to start from the end. With tail -n +7, tail reads from the start, but only prints what's after the 6th newline character. – Stéphane Chazelas May 18 '17 at 09:48

Efficiently remove first couple of lines from a text file

3 Answers3

What you already know:

What you might not know:

So, what next?

Caveats:

Linked

Related