2

This is similar to Shuffle two parallel text files

I have:

  • two large csv files with parallel lines. (they represent 'before' and 'after' states for particular items). The fields are sometimes strings, sometimes numbers.

  • a sufficiently long random data file to use with shuf

when I want to get a matching random sample I thought of:

shuf -n10 --random-source="random.csv" "file1" 
shuf -n10 --random-source="random.csv" "file2" 

but these files no longer match.

However, if I put line-numbers in front, it solves the problem:

shuf -n10 --random-source="random.csv" <(cat -n "file1") 
shuf -n10 --random-source="random.csv" <(cat -n "file2")

Can someone explain why?

here is sample of random.csv

0.293076138
0.446732207
0.552989654
0.16141527
0.099383023
...

Here is a snippet from the two files:

VA,DEFAULT,72.8027,11.9534.....
VA,DEFAULT,61.8356,11.9342....
VA,DEFAULT,61.8356,....

Note that the first two fields are identical in most of the rows in both files. Maybe this is the issue? I don't know shuf well enough.

Tim
  • 237
  • can't reproduce. Please paste a snippet of random.csv – iruvar Dec 05 '19 at 01:28
  • I also can not reproduce this behaviour. Can we assume that random.csv is not changing between the invocations of shuf? – Kusalananda Dec 05 '19 at 10:52
  • @Kusalananda certainly. it is simply a list of random numbers that I saved as a .csv file. It remains fixed. – Tim Dec 05 '19 at 21:29
  • 1
    Note: in my Debian 9 shuf (GNU coreutils) 8.26 generates different sequences for seekable vs. unseekable input, when -n is lower or equal to the number of lines. I.e. <file1 shuf -n10 --random-source=random.csv and cat file1 | shuf -n10 --random-source=random.csv are not equivalent. In addition when I increase -n (starting from 1) with seekable input, I see the same sequence with an additional line each time. This seems sane. With unseekable input the sequences seem independent. Could it be one of your actual commands used unseekable input? And you only simplified this to file2? – Kamil Maciorowski Dec 05 '19 at 22:26
  • a good suggestion, but no, all files are plain csv files, no pipes, no redirections. – Tim Dec 12 '19 at 03:29

0 Answers0