Getting selected lines from a file that is larger than available RAM+VM

Question

What's the best way to take a segment out of a text file? and the many others in the right sidebar there are almost duplicates.

The one difference is that my file was too large to fit in the available RAM+VM and so anything I tried would not only do nothing for minutes till killed, but would bog down the system. One of them made me unable to do anything until the system crashed.

I can write a loop in the shell or any other to read, count, and discard lines until the count (line number wanted) is reached, but maybe there exists already a single command that will do it?

After trying a few things (vim, head -X | tail -1, GUI editor), I gave up, deleted the file and changed the program that created it to give me only the lines needed.

Looking further, Opening files of size larger than RAM, no swap used. suggests that vi should do it, but if vi is the same, it was definitely doing something that takes minutes not seconds.

the utility split comes into mind. e.g. split -n 4 file.txt should split the file.txt into 4 parts. I don't really know split since I don't use it but you can give it a try? — Jetchisel, Jan 18 '20 at 01:21
I suspect it will try to read the whole file first, which would be impossible in my scenario. But if not, to get line N, split -l N-1 and then head -1 on the second file. Can't give it a try without re-creating the huge file. Rather not experiment, since one attempt already crashed the system. — WGroleau, Jan 18 '20 at 02:09
sed, awk, etc will not load the whole file in the memory. But they would load whole lines in the memory, which may be exactly what brings your system to its knees -- because of overlong lines. What kind of file is that? Is it an xml by chance? — , Jan 18 '20 at 02:10
And no, don't even think about using text editors like vim with any huge files -- not only they will load it whole in the memory, but they will create a structure out of it, which will double / triple or worse its size ;-) — , Jan 18 '20 at 02:13
I don't believe there were any lines longer than 140 bytes. But head -60000 file | tail -1 filed three screens with a single line of garbage that I am pretty sure was NOT in thle file. (And it did that after a minute or more of no visible output.). The file was created by looping through image file names and grepping for each one in two other files. So the lines were in the form filename: path/to/image. The reason it was so big is that one of the image file names was ".jpg" which matched everything in the other two files. — WGroleau, Jan 18 '20 at 02:32
Your file was corrupted. Try filtering out the binary garbage from it: LC_COLLATE=C tr -sc ' -~\n' '\n' <infile >outfile. — , Jan 18 '20 at 02:45
The file (again) no longer exists. But if it was corrupted, then grep is the guilty party. If not, then head & tail failed due to size. — WGroleau, Jan 18 '20 at 05:37
It wasn’t binary garbage. It was the first field of one line repeated a zillion times. — WGroleau, Jan 18 '20 at 05:45
First step would be to run wc on such a file, and figure the average line length. wc does not buffer a whole lot. Even one huge line can show you a suspiciously large average. — Paul_Pedant, Jan 18 '20 at 10:45

Antonio Zhu · Accepted Answer · 2020-01-18T17:56:32.483

You should try less.

From the manpages:

Also, less does not have to read the entire input file before starting, so with large input files it starts up faster than text editors like vi (1).

I open large files regularly with less and have no problems with the start up time. If you combine it with the option -jn, you can reach basically every part of the file in a trivial time.

I didn't test this because I'd need a pretty big file but combined with the head, you could script that too with more which is more fit for scripting.

more +$START_LINE_NUMBER | head -n$AMOUNT_OF_LINES

If I understood it right head should be finishing the pipe process after getting the required amount of lines, so this solution should be the low-overhead solution you were looking for.

Since one attempt crashed the machine, I’m not anxious to experiment. But if the situation arises again, I’ll remember this. A smaller-than-allowed edit: low-overhead, not zero. — WGroleau, Jan 18 '20 at 17:33

Getting selected lines from a file that is larger than available RAM+VM

1 Answers1