I have a huge (70GB), one line, text file and I want to replace a string (token) in it.
I want to replace the token <unk>, with another dummy token (glove issue).
I tried sed:
sed 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
but the output file corpus.txt.new has zero-bytes!
I also tried using perl:
perl -pe 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
but I got an out of memory error.
For smaller files, both of the above commands work.
How can I replace a string is such a file? This is a related question, but none of the answers worked for me.
Edit:
What about splitting the file in chunks of 10GBs (or whatever) each and applying sed on each one of them and then merging them with cat? Does that make sense? Is there a more elegant solution?
s/<unk>/<raw_unk>/. If you could usernkrather thanraw_unk, you could edit the file in-place with a simple C program - which would avoid having to write out 70GB (of course, you would still need to read 70GB, and if the<unk>occurs frequently, you end up rewriting most of the file anyway. – Martin Bonner supports Monica Jan 03 '18 at 11:20splitwith-boption defining chunk file sizes in bytes. Process each in turn usingsedand the re-assemble. There is a risk is that<unk>can be split in two files and won't be found... – Vladislavs Dovgalecs Jan 03 '18 at 23:44<unk>? – LSpice Jan 04 '18 at 17:06