If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf
appeared in version 6.0), shuf
(“shuffle”) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.
There's no ideal way to do that dispatch. You can't just chain head
and tail
because head
would buffer ahead. You can use split
, but you don't get any flexibility with respect to the output file names. You can use awk
, of course:
<input shuf | awk -v m=$m '{ if (NR <= m) {print >"output1"} else {print} }'
You can use sed
, which is obscure but possibly faster for large files.
<input shuf | sed -e "1,${m} w output1" -e "1,${m} d" >output2
Or you can use tee
to duplicate the data, if your platform has /dev/fd
; that's ok if m is small:
<input shuf | { tee /dev/fd/3 | head -n $m >output1; } 3>&1 | tail -n +$(($m+1)) >output2
Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.
<input awk -v N=$(wc -l <input) -v m=3 '
BEGIN {srand()}
{
if (rand() * N < m) {--m; print >"output1"} else {print >"output2"}
--N;
}'
If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.
<input perl -e '
open OUT1, ">", "output1" or die $!;
open OUT2, ">", "output2" or die $!;
my $N = `wc -l <input`;
my $m = $ARGV[0];
while (<STDIN>) {
if (rand($N) < $m) { --$m; print OUT1 $_; } else { print OUT2 $_; }
--$N;
}
close OUT1 or die $!;
close OUT2 or die $!;
' 42