This is similar to Shuffle two parallel text files
I have:
two large csv files with parallel lines. (they represent 'before' and 'after' states for particular items). The fields are sometimes strings, sometimes numbers.
a sufficiently long random data file to use with
shuf
when I want to get a matching random sample I thought of:
shuf -n10 --random-source="random.csv" "file1"
shuf -n10 --random-source="random.csv" "file2"
but these files no longer match.
However, if I put line-numbers in front, it solves the problem:
shuf -n10 --random-source="random.csv" <(cat -n "file1")
shuf -n10 --random-source="random.csv" <(cat -n "file2")
Can someone explain why?
here is sample of random.csv
0.293076138
0.446732207
0.552989654
0.16141527
0.099383023
...
Here is a snippet from the two files:
VA,DEFAULT,72.8027,11.9534.....
VA,DEFAULT,61.8356,11.9342....
VA,DEFAULT,61.8356,....
Note that the first two fields are identical in most of the rows in both files. Maybe this is the issue? I don't know shuf
well enough.
random.csv
– iruvar Dec 05 '19 at 01:28random.csv
is not changing between the invocations ofshuf
? – Kusalananda Dec 05 '19 at 10:52shuf (GNU coreutils) 8.26
generates different sequences for seekable vs. unseekable input, when-n
is lower or equal to the number of lines. I.e.<file1 shuf -n10 --random-source=random.csv
andcat file1 | shuf -n10 --random-source=random.csv
are not equivalent. In addition when I increase-n
(starting from1
) with seekable input, I see the same sequence with an additional line each time. This seems sane. With unseekable input the sequences seem independent. Could it be one of your actual commands used unseekable input? And you only simplified this tofile2
? – Kamil Maciorowski Dec 05 '19 at 22:26