I have two sentence-aligned parallel corpora (text files) with about 50 mil words. (from the Europarl corpus -> parallel translation of legal documents). I'd now like to shuffle the lines of the two files, but both in the same way. I wanted to approach that using gshuf (I'm on a Mac) using one unique random source.
gshuf --random-source /path/to/some/random/data file1
gshuf --random-source /path/to/some/random/data file2
But I got the error message end of file
, because apparently the random seed needs to contain all the words that the file to be sorted contains. Is that true? If yes, how should I create a random seed that is good for my needs?
If no, in what other way could I randomize the files in parallel?
I thought about pasting them together, randomizing and then splitting again. However, this seems ugly since I would need to first find a delimiter that doesn't occur in the files.
random sources
. As topaste
, you could use as delimiter some low ascii char that's unlikely to occur in your files (like\x02
,\x03
...). – don_crissti Aug 05 '15 at 18:00