17

How to split a large file into two parts, at a pattern?

Given an example file.txt:

ABC
EFG
XYZ
HIJ
KNL

I want to split this file at XYZ such that file1 contains lines up-to XYZ and rest of the lines in file2.

cuonglm
  • 153,898
d.putto
  • 313

6 Answers6

19

This is a job for csplit:

csplit -sf file -n 1 large_file /XYZ/

would silently split the file, creating pieces with prefix file and numbered using a single digit, e.g. file0 etc. Note that using /regex/ would split up to, but not including the line that matches regex. To split up to and including the line matching regex add a +1 offset:

csplit -sf file -n 1 large_file /XYZ/+1

This creates two files, file0 and file1. If you absolutely need them to be named file1 and file2 you could always add an empty pattern to the csplit command and remove the first file:

csplit -sf file -n 1 large_file // /XYZ/+1

creates file0, file1 and file2 but file0 is empty so you can safely remove it:

rm -f file0
don_crissti
  • 82,805
11

With awk you can do:

awk '{print >out}; /XYZ/{out="file2"}' out=file1 largefile


Explanation: The first awk argument (out=file1) defines a variable with the filename that will be used for output while the subsequent argument (largefile) is processed. The awk program will print all lines to the file specified by the variable out ({print >out}). If the pattern XYZ will be found the output variable will be redefined to point to the new file ({out="file2}") which will be used as target to print the subsequent data lines.

References:

Janis
  • 14,222
6
{ sed '/XYZ/q' >file1; cat >file2; } <infile

With GNU sed you should use the -unbuffered switch. Most other seds should just work though.

To leave XYZ out...

{ sed -n '/XYZ/q;p'; cat >file2; } <infile >file1
mikeserv
  • 58,310
6

With a modern ksh here's a shell variant (i.e. without sed) of one of the sed based answers above:

{ read in <##XYZ ; print "$in" ; cat >file2 ;} <largefile >file1


And another variant in ksh alone (i.e. also omitting the cat):

{ read in <##XYZ ; print "$in" ; { read <##"" ;} >file2 ;} <largefile >file1


(The pure ksh solution seem to be quite performant; on a 2.4 GB test file it needed 19-21 sec, compared to 39-47 sec with the sed/cat based approach).

Janis
  • 14,222
  • It's very fast. But I don't think you need to read and print - you should just let it go to output all its own. The performance gets better if you build the AST toolkit wholly and get all of the ksh builtins compiled in - it's weird to me that sed isn't one of them, actually. But with stuff like while <file do I guess you don't need sed so much... – mikeserv May 10 '15 at 13:54
  • I am curious though - how did awk perform in your benchmark? And while I'm pretty sure ksh will likely always win this fight, if you're using a GNU sed you're not being very fair to sed - GNU's -unbuffered is a piss-poor approach to POSIXLY ensuring the descriptor's offset is left where the program quit it - there should be no need to slow down the regular operation of the program - buffering is fine - all sed should have to do is lseek the descriptor when finished. For whatever reason GNU reverses that mentality. – mikeserv May 10 '15 at 14:05
  • @mikeserv; The redirection pattern match is done until the pattern is found, and the line with the found pattern will not be printed if not explicitly done as depicted. (At least that showed my test.) Note that there's no while; the printing is implicitly done as the defined side effect of the <## redirection operator. And only the matching line needs printing. (That way the shell feature implementation is most flexible for support of incl./excl.) An explicit while loop I'd expect to be significant slower (but haven't checked). – Janis May 10 '15 at 14:07
  • I know there's no while - that's a different syntax for something else (Korn calls it file-scan* mode). The current line can always be printed with head, it's probably what I would use. Either way - ksh is always* going ro win this. – mikeserv May 10 '15 at 14:09
  • @mikeserv; (WRT your second comment): I didn't mean to be fair or unfair; just used your sed code pattern and implemented it natively in ksh. The awk logic is (while similar) but different, so I didn't use it as base for the shell only code. But I measured that as well, and (as expected) it's slower than sed; ~60 sec. – Janis May 10 '15 at 14:10
  • It's not really you - GNU holds the responsibility in that department. Iupvoted this because it's an awesome answer - and that's true. I was just pointing out that GNU sed is a poor candidate for contention - but ksh is would be the champion regardless. – mikeserv May 10 '15 at 14:13
  • 1
    @mikeserv; Ah, okay. BTW, I just tried the head instead of the read; it seems only a little bit slower, but it's terser code: { head -1 <##XYZ ; { read <##"" ;} >file4 ;} <largefile >file3. – Janis May 10 '15 at 14:18
  • Is your head builtin? – mikeserv May 10 '15 at 14:18
  • 1
    @mikeserv; Good point; it wasn't. But when I activate the builtin (just done and checked the results) it's the same numbers, strangely. (Maybe some function call overhead compared to read?) – Janis May 10 '15 at 14:23
3

Try this with GNU sed:

sed -n -e '1,/XYZ/w file1' -e '/XYZ/,${/XYZ/d;w file2' -e '}' large_file
Cyrus
  • 12,309
1

An easy hack is to print either to STDOUT or STDERR, depending on whether the target pattern has been matched. You can then use the shell's redirection operators to redirect the output accordingly. For example, in Perl, assuming the input file is called f and the two output files f1 and f2:

  1. Discarding the line that matches the split pattern:

    perl -ne 'if(/XYZ/){$a=1; next} ; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2
    
  2. Including the matched line:

    perl -ne '$a=1 if /XYZ/; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2
    

Alternatively, print to different file handles:

  1. Discarding the line that matches the split pattern:

    perl -ne 'BEGIN{open($fh1,">","f1");open($fh2,">","f2");}
    if(/XYZ/){$a=1; next}$a==1 ? print $fh1 "$_" : print $fh2 "$_";' f
    
  2. Including the matched line:

    perl -ne 'BEGIN{open($fh1,">","f1"); open($fh2,">","f2");}
              $a=1 if /XYZ/; $a==1 ? print $fh1 "$_" : print $fh2 "$_";' f
    
terdon
  • 242,166