How to split a large file into two parts, at a pattern?
Given an example file.txt
:
ABC
EFG
XYZ
HIJ
KNL
I want to split this file at XYZ
such that file1
contains lines up-to XYZ
and rest of the lines in file2
.
How to split a large file into two parts, at a pattern?
Given an example file.txt
:
ABC
EFG
XYZ
HIJ
KNL
I want to split this file at XYZ
such that file1
contains lines up-to XYZ
and rest of the lines in file2
.
This is a job for csplit
:
csplit -sf file -n 1 large_file /XYZ/
would s
ilently split the file, creating pieces with pref
ix file
and n
umbered using a single digit, e.g. file0
etc. Note that using /regex/
would split up to, but not including the line that matches regex
. To split up to and including the line matching regex
add a +1
offset:
csplit -sf file -n 1 large_file /XYZ/+1
This creates two files, file0
and file1
. If you absolutely need them to be named file1
and file2
you could always add an empty pattern to the csplit
command and remove the first file:
csplit -sf file -n 1 large_file // /XYZ/+1
creates file0
, file1
and file2
but file0
is empty so you can safely remove it:
rm -f file0
With awk
you can do:
awk '{print >out}; /XYZ/{out="file2"}' out=file1 largefile
Explanation: The first awk
argument (out=file1
) defines a variable with the filename that will be used for output while the subsequent argument (largefile
) is processed. The awk
program will print all lines to the file specified by the variable out
({print >out}
). If the pattern XYZ
will be found the output variable will be redefined to point to the new file ({out="file2}"
) which will be used as target to print the subsequent data lines.
References:
{ sed '/XYZ/q' >file1; cat >file2; } <infile
With GNU sed
you should use the -u
nbuffered switch. Most other sed
s should just work though.
To leave XYZ out...
{ sed -n '/XYZ/q;p'; cat >file2; } <infile >file1
With a modern ksh
here's a shell variant (i.e. without sed
) of one of the sed
based answers above:
{ read in <##XYZ ; print "$in" ; cat >file2 ;} <largefile >file1
And another variant in ksh
alone (i.e. also omitting the cat
):
{ read in <##XYZ ; print "$in" ; { read <##"" ;} >file2 ;} <largefile >file1
(The pure ksh
solution seem to be quite performant; on a 2.4 GB test file it needed 19-21 sec, compared to 39-47 sec with the sed
/cat
based approach).
Try this with GNU sed:
sed -n -e '1,/XYZ/w file1' -e '/XYZ/,${/XYZ/d;w file2' -e '}' large_file
An easy hack is to print either to STDOUT or STDERR, depending on whether the target pattern has been matched. You can then use the shell's redirection operators to redirect the output accordingly. For example, in Perl, assuming the input file is called f
and the two output files f1
and f2
:
Discarding the line that matches the split pattern:
perl -ne 'if(/XYZ/){$a=1; next} ; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2
Including the matched line:
perl -ne '$a=1 if /XYZ/; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2
Alternatively, print to different file handles:
Discarding the line that matches the split pattern:
perl -ne 'BEGIN{open($fh1,">","f1");open($fh2,">","f2");}
if(/XYZ/){$a=1; next}$a==1 ? print $fh1 "$_" : print $fh2 "$_";' f
Including the matched line:
perl -ne 'BEGIN{open($fh1,">","f1"); open($fh2,">","f2");}
$a=1 if /XYZ/; $a==1 ? print $fh1 "$_" : print $fh2 "$_";' f
read
andprint
- you should just let it go to output all its own. The performance gets better if you build the AST toolkit wholly and get all of theksh
builtins compiled in - it's weird to me thatsed
isn't one of them, actually. But with stuff likewhile <file do
I guess you don't needsed
so much... – mikeserv May 10 '15 at 13:54awk
perform in your benchmark? And while I'm pretty sureksh
will likely always win this fight, if you're using a GNUsed
you're not being very fair tosed
- GNU's-u
nbuffered is a piss-poor approach to POSIXLY ensuring the descriptor's offset is left where the program quit it - there should be no need to slow down the regular operation of the program - buffering is fine - allsed
should have to do is lseek the descriptor when finished. For whatever reason GNU reverses that mentality. – mikeserv May 10 '15 at 14:05while
; the printing is implicitly done as the defined side effect of the<##
redirection operator. And only the matching line needs printing. (That way the shell feature implementation is most flexible for support of incl./excl.) An explicitwhile
loop I'd expect to be significant slower (but haven't checked). – Janis May 10 '15 at 14:07while
- that's a different syntax for something else (Korn calls it file-scan* mode). The current line can always be printed withhead
, it's probably what I would use. Either way -ksh
is always* going ro win this. – mikeserv May 10 '15 at 14:09sed
is a poor candidate for contention - butksh
is would be the champion regardless. – mikeserv May 10 '15 at 14:13head
instead of theread
; it seems only a little bit slower, but it's terser code:{ head -1 <##XYZ ; { read <##"" ;} >file4 ;} <largefile >file3
. – Janis May 10 '15 at 14:18head
builtin? – mikeserv May 10 '15 at 14:18