I have a huge (70GB), one line, text file and I want to replace a string (token) in it.
I want to replace the token <unk>
, with another dummy token (glove issue).
I tried sed
:
sed 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
but the output file corpus.txt.new
has zero-bytes!
I also tried using perl:
perl -pe 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
but I got an out of memory error.
For smaller files, both of the above commands work.
How can I replace a string is such a file? This is a related question, but none of the answers worked for me.
Edit:
What about splitting the file in chunks of 10GBs (or whatever) each and applying sed
on each one of them and then merging them with cat
? Does that make sense? Is there a more elegant solution?
s/<unk>/<raw_unk>/
. If you could usernk
rather thanraw_unk
, you could edit the file in-place with a simple C program - which would avoid having to write out 70GB (of course, you would still need to read 70GB, and if the<unk>
occurs frequently, you end up rewriting most of the file anyway. – Martin Bonner supports Monica Jan 03 '18 at 11:20split
with-b
option defining chunk file sizes in bytes. Process each in turn usingsed
and the re-assemble. There is a risk is that<unk>
can be split in two files and won't be found... – Vladislavs Dovgalecs Jan 03 '18 at 23:44<unk>
? – LSpice Jan 04 '18 at 17:06