Randomly draw a certain number of lines from a data file

Question

I have a data list, like

Assume the size of this data set (i.e. lines of file) is N. I want to randomly draw m lines from this data file. Therefore, the output should be two files, one is the file including these m lines of data, and the other one includes N-m lines of data.

Is there a way to do that using a Linux command?

Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random? — Peter.O, Jan 22 '12 at 14:04

score 18 · Accepted Answer · edited Jan 23 '12 at 00:49

18

This might not be the most efficient way but it works:

shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2

With $m containing the number of lines.

edited Jan 23 '12 at 00:49

Gilles 'SO- stop being evil'

829,060

answered Jan 22 '12 at 13:52

Rob Wouters

666

@userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first. – Rob Wouters Jan 22 '12 at 14:31
2

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file. – Gilles 'SO- stop being evil' Jan 23 '12 at 00:45
why not shuf <file> |head -n $m? – emanuele Jun 19 '14 at 16:56
@emanuele: Because we need both the head and the tail in two separate files. – Rob Wouters Jun 20 '14 at 07:39

Peter.O · Answer 2 · 2012-01-22T16:05:44.767

This bash/awk script chooses lines at random, and maintains the original sequence in both output files.

awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 \
 'BEGIN{ srand()
         do{ lnb = 1 + int(rand()*N)
             if ( !(lnb in R) ) {
                 R[lnb] = 1
                 ct++ }
         } while (ct<m)
  } { if (R[NR]==1) print > out1 
      else          print > out2       
  }' file
cat /tmp/out1
echo ========
cat /tmp/out2

Output, based ont the data in the question.

12345
23456
200
600
========
67891
-20000
20

score 5 · Answer 3 · edited Apr 13 '17 at 12:36

As with all things Unix, There's a Utility for That^TM.

Program of the day: split
split will split a file in many different ways, -b bytes, -l lines, -n number of output files. We will be using the -l option. Since you want to pick random lines and not just the first m, we'll sort the file randomly first. If you want to read about sort, refer to my answer here.

Now, the actual code. It's quite simple, really:

sort -R input_file | split -l $m output_prefix

This will make two files, one with m lines and one with N-m lines, named output_prefixaa and output_prefixab. Make sure m is the larger file you want or you'll get several files of length m (and one with N % m).

If you want to ensure that you use the correct size, here's a little code to do that:

m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix

Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.

Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at http://beesbuzz.biz/code/ for anyone who needs it. (I tend to shuffle file contents quite a lot.) — fluffy, Jan 22 '12 at 18:49
Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split. — Gilles 'SO- stop being evil', Jan 23 '12 at 00:48

Gilles 'SO- stop being evil' · Answer 4 · 2012-01-24T15:09:06.950

If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf appeared in version 6.0), shuf (“shuffle”) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.

There's no ideal way to do that dispatch. You can't just chain head and tail because head would buffer ahead. You can use split, but you don't get any flexibility with respect to the output file names. You can use awk, of course:

<input shuf | awk -v m=$m '{ if (NR <= m) {print >"output1"} else {print} }'

You can use sed, which is obscure but possibly faster for large files.

<input shuf | sed -e "1,${m} w output1" -e "1,${m} d" >output2

Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:

<input shuf | { tee /dev/fd/3 | head -n $m >output1; } 3>&1 | tail -n +$(($m+1)) >output2

Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.

<input awk -v N=$(wc -l <input) -v m=3 '
    BEGIN {srand()}
    {
        if (rand() * N < m) {--m; print >"output1"} else {print >"output2"}
        --N;
    }'

If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.

<input perl -e '
    open OUT1, ">", "output1" or die $!;
    open OUT2, ">", "output2" or die $!;
    my $N = `wc -l <input`;
    my $m = $ARGV[0];
    while (<STDIN>) {
        if (rand($N) < $m) { --$m; print OUT1 $_; } else { print OUT2 $_; }
        --$N;
    }
    close OUT1 or die $!;
    close OUT2 or die $!;
' 42

@Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print" — Peter.O, Jan 23 '12 at 04:49
@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code. — Gilles 'SO- stop being evil', Jan 23 '12 at 10:12
All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example. — Peter.O, Jan 23 '12 at 23:03
A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 { for i in {00001..10000} ;do echo $i; done; } | { head -n 5000 >out1; cat >out2; } .. TEST 3-4 { for i in {00001..10000} ;do echo $i; done; } >input; cat input | { head -n 5000 >out3; cat >out4; } ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code — Peter.O, Jan 24 '12 at 04:00
@Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions. — Gilles 'SO- stop being evil', Jan 24 '12 at 15:10

score 2 · Answer 5 · edited Jun 11 '20 at 12:04

2

Assuming m = 7 and N = 21:

cp ints ints.bak
for i in {1..7}
do
    rnd=$((RANDOM%(21-i)+1))
    # echo $rnd;  
    sed -n "${rnd}{p,q}" 10k.dat >> mlines 
    sed -i "${rnd}d" ints 
done

Note: If you replace 7 with a variable like $1 or $m, you have to use seq, not the {from..to}-notation, which doesn't do variable expansion.

It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.

This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.

edited Jun 11 '20 at 12:04

Community

1

answered Jan 22 '12 at 14:19

user unknown

10,482

He needs a file with the lines that are removed too. – Rob Wouters Jan 22 '12 at 14:36
I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly. – user unknown Jan 22 '12 at 14:39
Looks good. ```` – Rob Wouters Jan 22 '12 at 14:52
@user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method... – Peter.O Jan 23 '12 at 12:04
@Peter.O: You're right (corrected) and you're right. – user unknown Jan 23 '12 at 12:38

Randomly draw a certain number of lines from a data file

5 Answers5

This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.