1

I have a file that contains 3494 lines, of which I would like to randomly select 100, and write those lines to a new file. I can do that using this:

shuf -n 100 input_file.txt output_file.txt

However, I have many such input files, and I'd like to select the same 100 lines from each file. That is, I need to keep the line indices chosen by the first shuf and select those same lines in the other files. How can I do this?

EDIT:

The first answer was helpful, but I still have an issue with selecting from the correct file. I have 10 files from which I would like to select the same 100 lines. I end up with 1100 lines somehow.

3 Answers3

1

You could first extract 100 random numbers from 3494 and then extract those line numbers from each file e.g.

seq 3494 | shuf -n 100 | awk 'NR==FNR{ z[$0]++;next}
{if (FNR in z){ print >FILENAME"_random"}}' - ./*.txt

This will extract the same line no.s from each file and save them to FILENAME"_random"

don_crissti
  • 82,805
  • Where do I input the file name from which I would like to extract lines? Can you explain a bit more about what these commands are dong? – StatsSorceress Mar 02 '17 at 18:47
  • @StatsSorceress - you don't input any file name. You just extract 100 random numbers from 1 to 3494. Those numbers are then used to print the respective line numbers from each file. Let me know if it's clearer... Oh, I assume by "line indices" you mean line numbers right ? – don_crissti Mar 02 '17 at 18:50
  • Thanks @don_crissti, that makes more sense. I read a little more about awk, and that helped too. So is it fair to say that this is saying 'from a sequence of 1 to 3494, sample 100 numbers randomly without replacement, and if the selected number is equal to the line number we're currently at in the file, then take that line and export it to another file'? – StatsSorceress Mar 02 '17 at 18:53
  • @StatsSorceress - I would say that's a pretty accurate description. In my example I used ./*.txt as file arguments to awk , change that to suit your setup/your filenames but note the dash - has to come before the file names as that means "read the numbers from stdin" and only then process each file in turn. – don_crissti Mar 02 '17 at 18:54
  • Aha! Is this solution looking at every .txt file I have in this directory? – StatsSorceress Mar 02 '17 at 19:05
1

You could create a simple sed script file to print lines at specific indices e.g.

printf '%dp\n' $(shuf -i 1-3494 -n 100) > indexfile

then use it like

sed -nf indexfile File1
sed -nf indexfile File2
.
.

and so on. If you have GNU sed with the -s, --separate you can select the same lines from multiple files sequentially using

sed -snf indexfile File1 File2 File3

(replace File1 File2 File3 with a shell glob if you wish).

If you want a one-liner that selects a different random subset each invocation, then you could do something like

printf '%dp\n' $(shuf -i 1-3494 -n 100) | sed -snf - File1 File2 File3
steeldriver
  • 81,074
0
perl -ls0777ne 'print for(split $\)[split $\,$r]' -- -r="$(shuf -i 0-3493 -n 100)" -- ./*.txt

r="$(shuf -i 0-3493 -n 5)" \
perl -l -0777ne 'print for(split $\)[split $\,$ENV{r}]' ./*.txt

The random combination is generated and passed to Perl via the command line thus ensuring that all files get the same random sequence. Each file is slurped is then split on newline and selected in one go via the @A[...] construct. Note that since Perl's indices start from zero, the shuf command is given 0..3494-1