32

I am writing a PHP script to parse a large text file to do database inserts from it. However on my host, the file is too large, and I hit the memory limit for PHP.

The file has about 16,000 lines; I want to split it up into four separate files (at first) to see if I can load those.

The first part I can get with head -4000 file.txt. The middle sections are slightly trickier -- I was thinking about piping tail output into head ( tail -4001 file.txt | head -4000 > section2.txt ), but is there another/better way?

Actually my logic is messed up -- for section two, I would need to so something like tail -12001 file.txt | head - 4000, and then lower the tail argument for the next sections. I'm getting mixed up already! :P

user394
  • 14,404
  • 21
  • 67
  • 93

3 Answers3

43

If you want not to get messed up but still do it using tail and head, there is a useful way of invoking tail using a line-count from the beginning, not the end:

tail -n +4001 yourfile | head -4000

... But a better, automatic tool made just for splitting files is called... split! It's also a part of GNU coreutils, so any normal Linux system should have it. Here's how you can use it:

split -l 4000 yourInputFile thePrefixForOutputFiles

(See man split if in doubt.)

  • Given that split will split the file into three chunks, only one of which one cares about, surely the sed answer is preferable. – cbmanica Oct 01 '21 at 16:17
30

Combining head and tail as you did will work, but for this I would use sed

sed -n '1,4000p' input_file # print lines 1-4000 of input_file

This lets you solve your problem with a quick shell function

chunk_it(){
    step=4
    start=1
    end=$step
    for n in {1..4} ; do
        sed -n "${start},${end}p" "$1" > "$1".$start-$end
        let start+=$step
        let end+=$step
    done
}

chunk_it your_file

Now you have your_file.1-4000 and yuor_file.4001-8000 and so on.

Note: requires bash

Sorpigal
  • 1,167
  • 3
    I like the sed way. – fanchyna Feb 20 '16 at 15:38
  • This doesn't work for me because sed doesn't exit. It prints out the lines I want to stdout, but I have to ctrl-c out, and as a result, I can't redirect it to a file. Any suggestion to make it usable? – Brent212 Jun 30 '17 at 18:41
  • Figured it out! "sed -n '<start_line>,<end_line>w <output_file>' <input_file>" works for me. – Brent212 Jun 30 '17 at 18:54
  • @Brent212 Another option to note is that you can also pipe it into less or redirect the output to a file. – Kyle s Dec 19 '18 at 19:54
  • On GNU sed, redirecting can work. The problem is that sed buffers its output by default. Use -u to disable buffering and redirecting should work. Note that the -u option is not available on all versions of sed. – rinogo Jan 18 '22 at 16:15
0

You can also use bat like

bat -r 4001:8000 input-file.txt >output-file-1.txt

The benefit is that you can omit the output redirection to get an idea what’s being written:

project specification for Bat

Note: It’s probably overkill to install bat just for this feature; this answer is useful if you already have it installed for some reason.

Franklin Yu
  • 1,237