9

I have two sentence-aligned parallel corpora (text files) with about 50 mil words. (from the Europarl corpus -> parallel translation of legal documents). I'd now like to shuffle the lines of the two files, but both in the same way. I wanted to approach that using gshuf (I'm on a Mac) using one unique random source.

gshuf --random-source /path/to/some/random/data file1
gshuf --random-source /path/to/some/random/data file2

But I got the error message end of file, because apparently the random seed needs to contain all the words that the file to be sorted contains. Is that true? If yes, how should I create a random seed that is good for my needs? If no, in what other way could I randomize the files in parallel? I thought about pasting them together, randomizing and then splitting again. However, this seems ugly since I would need to first find a delimiter that doesn't occur in the files.

conipo
  • 93
  • 1
    You got that error because your random_file doesn't contain enough bytes... See random sources. As to paste, you could use as delimiter some low ascii char that's unlikely to occur in your files (like \x02, \x03...). – don_crissti Aug 05 '15 at 18:00
  • Alright, whatever I want to randomize, if I use /dev/urandom, I should be good to go, right? The paste delimiter is a good tip, thanks! – conipo Aug 07 '15 at 08:09

1 Answers1

13

I do not know if there is a more elegant method but this works for me:

mkfifo onerandom tworandom threerandom
tee onerandom tworandom threerandom < /dev/urandom > /dev/null &
shuf --random-source=onerandom onefile > onefile.shuf &
shuf --random-source=tworandom twofile > twofile.shuf &
shuf --random-source=threerandom threefile > threefile.shuf &
wait

Result:

$ head -n 3 *.shuf
==> onefile.shuf <==
24532 one
47259 one
58678 one

==> threefile.shuf <==
24532 three
47259 three
58678 three

==> twofile.shuf <==
24532 two
47259 two
58678 two

But the files must have the exact same number of lines.


The GNU Coreutils documentation also provides a nice solution for repeated randomness using openssl as a seeded random generator:

https://www.gnu.org/software/coreutils/manual/html_node/Random-sources.html#Random-sources

get_seeded_random()
{
  seed="$1"
  openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
    </dev/zero 2>/dev/null
}

shuf -i1-100 --random-source=<(get_seeded_random 42)

However, consider using a better seed than "42", unless you want anyone else to be able to reproduce "your" random result, too.

frostschutz
  • 48,978
  • This works like a charm. Would you mind explaining the steps that you took? The tee command ensures that the same random number is stored in all three pipes, right? Why does it need to output to /dev/null as well? And is it automatically ensured that there are enough bytes and that the end of file error doesn't occur? – conipo Aug 06 '15 at 18:45
  • The /dev/null is because tee also prints to stdout. Could use > threerandom instead but it's harder to script. The named pipes will produce as much random data as is needed, so you don't have to know beforehand how much you'll need. – frostschutz Aug 06 '15 at 19:14
  • Ok, and why can't it then just be one pipe that you use as random source for all 3 shuffles one after another? – conipo Aug 06 '15 at 19:17
  • 2
    You can't read the same data three times from one pipe. You have to multiplex somehow and that's what tee does... – frostschutz Aug 06 '15 at 19:35