Replace string in a huge (70GB), one line, text file

Question

I have a huge (70GB), one line, text file and I want to replace a string (token) in it. I want to replace the token <unk>, with another dummy token (glove issue).

I tried sed:

sed 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new

but the output file corpus.txt.new has zero-bytes!

I also tried using perl:

perl -pe 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new

but I got an out of memory error.

For smaller files, both of the above commands work.

How can I replace a string is such a file? This is a related question, but none of the answers worked for me.

Edit: What about splitting the file in chunks of 10GBs (or whatever) each and applying sed on each one of them and then merging them with cat? Does that make sense? Is there a more elegant solution?

as @Gilles noted, can you detect some repeated character that could serve as a custom delimiter in your single big line? — RomanPerekhrest, Dec 29 '17 at 15:35
I am thinking that a tool that can only do search and replace, but not any more complex regex, would be faster. It would also not benefit from doing a line at a time, so would not choke on this file. Unfortunately I have no idea of the existence of such a tool, though it would not be hard to write. If it is a one off then substituting in newline characters as in one of the answers would probably be easiest. — ctrl-alt-delor, Dec 29 '17 at 15:43
Does your file contain anything other than ASCII? If so, all the unicode handling could be omitted and raw bytes could be processed. — Patrick Bucher, Dec 29 '17 at 19:54
I agree with @PatrickButcher Look at a bigger picture. Besides the immediate need to replace this text, what else is this file supposed to be used for? If it is a log of some sort, no one is going to be able to work with it effectively. If it is a data file that some app uses, then that app should hold the responsibility for maintaining the data in that file. — Thomas Carlisle, Dec 30 '17 at 13:47
@ThomasCarlisle: The huge file is a GloVe corpus, but I'm still not 100% sure what GloVe actually does. I think it might be useful for natural language processing: teaching computers to analyze plain-English text and to understand what the text really means. GloVe is useful for training and using word embeddings. See also a related Quora post and a relevant blog post. — unforgettableidSupportsMonica, Jan 03 '18 at 09:50
Your example replacement is s/<unk>/<raw_unk>/. If you could use rnk rather than raw_unk, you could edit the file in-place with a simple C program - which would avoid having to write out 70GB (of course, you would still need to read 70GB, and if the <unk> occurs frequently, you end up rewriting most of the file anyway. — Martin Bonner supports Monica, Jan 03 '18 at 11:20
In case you have easy access to a Mac system, Hex Fiend is designed for editing huge files. — LSpice, Jan 03 '18 at 21:17
You can use split with -b option defining chunk file sizes in bytes. Process each in turn using sed and the re-assemble. There is a risk is that <unk> can be split in two files and won't be found... — Vladislavs Dovgalecs, Jan 03 '18 at 23:44
@VladislavsDovgalecs, good idea! For the problem you mention, maybe a sliding window will address it (for example, checking files split on 0-, 1-, 2-, 3-, and 4-aligned boundaries quintuples the work, but prevents a possible miss); or else, more economically, just giving special processing to any chunk (and its successor) that ends with any initial substring of <unk>? — LSpice, Jan 04 '18 at 17:06
Is this an XML file? In that case, there are XML parsers that can do manipulations on XML streams... — Kusalananda, Mar 16 '18 at 09:32

score 116 · Answer 1 · edited Jan 01 '18 at 18:07

116

For such a big file, one possibility is Flex. Let unk.l be:

%%
\<unk\>     printf("<raw_unk>");  
%%

Then compile and execute:

$ flex -o unk.c  unk.l
$ cc -o unk -O2 unk.c -lfl
$ unk < corpus.txt > corpus.txt.new

edited Jan 01 '18 at 18:07

Zanna

3,571

answered Dec 29 '17 at 16:40

JJoao

12,170
1
23
45

5

make has default rules for this, instead of the flex/cc you can add an %option main as the first line of unk.l and then just make unk. I more-or-less reflexively use %option main 8bit fast, and have export CFLAGS='-march=native -pipe -Os' in my .bashrc. – jthill Dec 30 '17 at 17:16
1

@undercat: If it weren't off-topic, I could show you a number of non-compiler front end applications, from solving the water-level problem to special-purpose input parsing. It's amazing what you can do with it, if you think outside the box a bit :-) – jamesqf Dec 31 '17 at 04:50
@jthill, thank you: %option main + make + optionally CFLAGS are a very nice trick!! Is -march=native the default behaviour? – JJoao Jan 03 '18 at 16:49
@jamesqf, I feel the same and I am curious about your point of view and message: why not writing a new question to specifically raise that topic? – JJoao Jan 03 '18 at 16:56
@JJoao: I'm not sure how such a question could be worded. "What interesting non compiler frontend things can you do with flex?" perhaps. – jamesqf Jan 03 '18 at 20:29
1

@jamesqf as you said - will be hard to make that an on topic question - but I would like to see it also – Zombo Jan 04 '18 at 00:57
1

@jamesqf A prof of mine at uni used flex to build a tool that recognised fabric types for a factory! How about asking something like: "flex seems like a very powerful tool but I'm unlikely to be writing any compilers/parsers - are there any other use cases for flex?" – Paul Evans Jan 04 '18 at 20:37

Gilles 'SO- stop being evil' · Accepted Answer · 2017-12-30T11:30:14.023

111

The usual text processing tools are not designed to handle lines that don't fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).

If there's an ASCII character that appears frequently in the file and doesn't appear in <unk> or <raw_unk>, then you can use that as the record separator. Since most tools don't allow custom record separators, swap between that character and newlines. tr processes bytes, not lines, so it doesn't care about any record size. Supposing that ; works:

<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new

You could also anchor on the first character of the text you're searching for, assuming that it isn't repeated in the search text and it appears frequently enough. If the file may start with unk>, change the sed command to sed '2,$ s/… to avoid a spurious match.

<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new

Alternatively, use the last character.

<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new

Note that this technique assumes that sed operates seamlessly on a file that doesn't end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you'll avoid any portability trouble.

edited Dec 30 '17 at 11:30

answered Dec 29 '17 at 15:07

Gilles 'SO- stop being evil'

829,060

9

I don't have such a file to test with, but in Awk you can specify the "Record Separator" and the "Output Record Separator". So assuming you have a decent smattering of commas in your file, it's possible you could solve this with: awk -v RS=, -v ORS=, '{gsub(/<unk>/, "<raw_unk>"); print}' No? – Wildcard Dec 30 '17 at 07:33
4

@Wildcard Yes, that's another solution. Awk tends to be slower than sed though, that's why I don't offer it as the preferred solution for a huge file. – Gilles 'SO- stop being evil' Dec 30 '17 at 11:20
You can set the record separator in Perl with command line option -0 and the octal value of a char, or inside the script it can be set with special variable $/ – beasy Dec 31 '17 at 22:27
@Gilles : But using awk avoid passing the stream twice to tr. So would it be still slower ? – user285259 Jan 01 '18 at 22:20
2

@user285259 Typically not. tr is very fast and the pipe can even be parallelized. – Gilles 'SO- stop being evil' Jan 02 '18 at 07:24
i've never used tr before and only a little sed. if the gawk solution is a little more easy to read/understand, maybe post the gawk solution in the comments. – Trevor Boyd Smith Jan 05 '18 at 15:43
Sorry, but why not just using tr directly to solve the problem? I mean what's wrong with tr "<unk>" "<raw_unk>"? – jena Mar 29 '21 at 08:33
@jena tr changes single characters. This command changes all u to r, all n to a, all k to w and all > to _. It's impossible to replace a specific multi-character string with tr. – Gilles 'SO- stop being evil' Mar 29 '21 at 09:29
Oh I see, I forgot about that, thanks. – jena Mar 30 '21 at 14:47

sourcejedi · Answer 3 · 2018-01-01T18:32:00.560

41

So you don't have enough physical memory (RAM) to hold the whole file at once, but on a 64-bit system you have enough virtual address space to map the entire file. Virtual mappings can be useful as a simple hack in cases like this.

The necessary operations are all included in Python. There are several annoying subtleties, but it does avoid having to write C code. In particular, care is needed to avoid copying the file in memory, which would defeat the point entirely. On the plus side, you get error-reporting for free (python "exceptions") :).

#!/usr/bin/python3
# This script takes input from stdin
# (but it must be a regular file, to support mapping it),
# and writes the result to stdout.

search = b'<unk>'
replace = b'<raw_unk>'


import sys
import os
import mmap

# sys.stdout requires str, but we want to write bytes
out_bytes = sys.stdout.buffer

mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ)
i = mem.find(search)
if i < 0:
    sys.exit("Search string not found")

# mmap object subscripts to bytes (making a copy)
# memoryview object subscripts to a memoryview object
# (it implements the buffer protocol).
view = memoryview(mem)

out_bytes.write(view[:i])
out_bytes.write(replace)
out_bytes.write(view[i+len(search):])

edited Jan 01 '18 at 18:32

answered Dec 29 '17 at 21:44

sourcejedi

50,249

If My system has about 4 gb consequite memory free out of the 8 gb, does mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ) mean that it place the data in that space? Or would it be much lower (1gb?)> – Rahul Dec 31 '17 at 09:04
1

@Rahul "So you don't have enough RAM, but on a 64-bit system you have enough virtual address space to map the entire file." It's paged in and out of physical ram on demand (or lack thereof). This program should work without requiring any large amount of physical RAM. 64-bit systems have much more virtual address space than the maximum physical ram. Also each running process has it's own virtual address space. This means the system as a whole running out of virtual address space isn't a thing, it's not a valid concept. – sourcejedi Dec 31 '17 at 11:12
Thank you. I was trying to understand from this example how the os will decide regarding memory allocation when we use this python method. I understand now, its just same a typical executable. The same range of memory, hence no limitation or optimizations. I am not familiar with memory mapping in python. Thank you for the exaplanation. – Rahul Dec 31 '17 at 13:32
4

@Rahul yep! python mmap.mmap() is a fairly thin wrapper around the C function mmap(). And mmap() is the same mechanism used to run executables, and code from shared libraries. – sourcejedi Dec 31 '17 at 13:50
1

But why would one want to avoid writing C? For this, it's no more difficult (assuming a knowedge of both languages) than Python, and perhaps a bit more compact. – jamesqf Jan 01 '18 at 02:22
2

@jamesqf I could be wrong, but I feel it is just a personal choice. Since the performance losses would be negligible (because as he said, the function actual does call the c function), the overhead wastage is very low, since no other stuff is happening in between. C would have been better, but this solution was not aiming for optimization, just to solve the bigger and difficult 70gb issue. – Rahul Jan 01 '18 at 08:07
1

In general, writing in python is more compact. In this case it turned out there's a couple of details in the python version, and the C version might have been nicer to write. (Though it's not so simple if search can contain a NUL character. And I notice the other C version here does not support NUL characters in replace.). You're very welcome to derive the C version for comparison purposes. However remember that my version includes basic error reporting for the operations it performs. The C version would at least be more annoying to read IMO, when error reporting is included. – sourcejedi Jan 01 '18 at 10:30
1

@Rahul: Sure, it's personal choice, but I don't think "to avoid writing in C" is a good basis for that choice. Especially if one happens to be fluent in C. Indeed, I'd probably use C to avoid writing in Python :-) – jamesqf Jan 01 '18 at 18:13
@jamesqf I respect your opinion, and also that of op. When I answered your comment, the context was my comment - regarding memory space and how a c function was wrapped. Hence my response. When I read this response of yours, I did read the answer again, and understood the context you quote. I had not responded to that. So, I withdraw. I think the conversation was between you two, and I meddled, sorry for that to both of you. And also, I dont know deep c or python, I am studying :) – Rahul Jan 01 '18 at 18:27
1

@jamesqf: I'm unaware of a standard libc function that searches arbitrary byte sequences in raw memory, i. e. disregarding any NUL termination characters. Of course we can roll our own, but that's going to be at least one of slow (with a naive algorithm) or error prone (for an efficient algorithm). Python’s find function on buffer-like objects is supposedly well tested and implements a sensible search algorithm like Boyer-Moore. – David Foerster Jan 05 '18 at 00:08

score 17 · Answer 4 · answered Dec 29 '17 at 21:11

There is a replace utility in the mariadb-server/mysql-server package. It replaces simple strings (not regular expressions) and unlike grep/sed/awk replace does not care about \n and \0. Memory consumption is constant with any input file (about 400kb on my machine).

Of course you do not need to run a mysql server in order to use replace, it is only packaged that way in Fedora. Other distros/operating systems may have it packaged separately.

score 16 · Answer 5 · edited Dec 31 '17 at 21:43

16

I think the C version might perform much better:

#include <stdio.h>
#include <string.h>

#define PAT_LEN 5

int main()
{
    /* note this is not a general solution. In particular the pattern
     * must not have a repeated sequence at the start, so <unk> is fine
     * but aardvark is not, because it starts with "a" repeated, and ababc
     * is not because it starts with "ab" repeated. */
    char pattern[] = "<unk>";          /* set PAT_LEN to length of this */
    char replacement[] = "<raw_unk>"; 
    int c;
    int i, j;

    for (i = 0; (c = getchar()) != EOF;) {
        if (c == pattern[i]) {
            i++;
            if (i == PAT_LEN) {
                printf("%s", replacement);
                i = 0;
            }
        } else {
            if (i > 0) {
                for (j = 0; j < i; j++) {
                    putchar(pattern[j]);
                }
                i = 0;
            }
            if (c == pattern[0]) {
                i = 1;
            } else {
                putchar(c);
            }
        }
    }
    /* TODO: fix up end of file if it ends with a part of pattern */
    return 0;
}

EDIT: Modified according to suggestions from the comments. Also fixed bug with the pattern <<unk>.

edited Dec 31 '17 at 21:43

icarus

17,920

answered Dec 29 '17 at 20:14

Patrick Bucher

775

I measured 0.3 seconds for 30 megabytes, so it could be done within 12 minutes. – Patrick Bucher Dec 29 '17 at 20:18
2

you may print (pattern[j]) instead of (buf[j]) (they are equal at this point, so you don't need buffer – RiaD Dec 30 '17 at 01:30
3

also code will not work for string "<" https://ideone.com/ncM2yy – RiaD Dec 30 '17 at 01:31
1

If you want to do it efficiently, doing binary read/writes and asynchronous I/O would probably help. – jamesqf Dec 30 '17 at 04:21
10

30 MB in 0.3 seconds? That's only 90 MB / second. memcpy speed (i.e. the memory bottleneck) is something like 12GB / second on a recent x86 CPU (e.g. Skylake). Even with stdio + system call overhead, for a 30MB file hot in disk cache, I'd expect maybe 1GB / second for an efficient implementation. Did you compile with optimization disabled, or is one-char-at-a-time I/O really that slow? getchar_unlocked / putchar_unlocked might help, but definitely better to read/write in chunks of maybe 128kiB (half of L2 cache size on most x86 CPUs, so you mostly hit in L2 while looping after read) – Peter Cordes Dec 30 '17 at 06:58
2

from top of my head, getchar and putchar is slow. – Rui F Ribeiro Dec 30 '17 at 17:20
3

The fix to the program for "<<unk>" still doesn't work if the pattern starts with a repeated sequence of characters (i.e. it wouldn't work if you were trying to replace aardvark with zebra and you had input of aaardvak, or you were trying to replace ababc and had input of abababc). In general you can not move forward by the number of characters you have read unless you know that there is no possibility of a match starting in the characters you have read. – icarus Dec 30 '17 at 21:27
@icarus: Sure, the program only works for the pattern <unk>. It's not a general solution to a substitution problem, it's a solution for the very specific problem at hand. – Patrick Bucher Dec 31 '17 at 12:04
All these bugs is why writing it in python was the better choice. Premature optimization is the root of all evil and the hallmark of an inexperienced engineer! – Byron Whitlock Jan 06 '18 at 16:20

meuh · Answer 6 · 2017-12-29T16:43:00.297

GNU grep can show you the offset of matches in "binary" files, without having to read whole lines into memory. You can then use dd to read up to this offset, skip over the match, then continue copying from the file.

file=...
newfile=...
replace='<raw_unk>'
grep -o -b -a -F '<unk>' <"$file" |
(   pos=0
    while IFS=$IFS: read offset pattern
    do size=${#pattern}
       let skip=offset-pos
       let big=skip/1048576
       let skip=skip-big*1048576
       dd bs=1048576 count=$big <&3
       dd bs=1 count=$skip <&3
       dd bs=1 count=$size of=/dev/null <&3
       printf "%s" "$replace"
       let pos=offset+size
    done
    cat <&3
) 3<"$file" >"$newfile"

For speed, I've split the dd into a big read of blocksize 1048576 and a smaller read of 1 byte at a time, but this operation will still be a little slow on such a large file. The grep output is, for example, 13977:<unk>, and this is split on the colon by the read into variables offset and pattern. We have to keep track in pos of how many bytes have already been copied from the file.

alfreema · Answer 7 · 2018-01-02T13:15:32.823

Here is another single UNIX command line that might perform better than other options, because you can "hunt" for a "block size" that performs well. For this to be robust you need to know that you have at least one space in every X characters, where X is your arbitrary "block size". In the example below I have chosen a "block size" of 1024 characters.

fold -w 1024 -s corpus.txt | sed 's/<unk>/<raw_unk>/g' | tr '/n' '/0'

Here, fold will grab up to 1024 bytes, but the -s makes sure it breaks on a space if there is at least one since the last break.

The sed command is yours and does what you expect.

Then the tr command will "unfold" the file converting the newlines that were inserted back to nothing.

You should consider trying larger block sizes to see if it performs faster. Instead of 1024, you might try 10240 and 102400 and 1048576 for the -w option of fold.

Here is an example broken down by each step that converts all the N's to lowercase:

[root@alpha ~]# cat mailtest.txt
test XJS C4JD QADN1 NSBN3 2IDNEN GTUBE STANDARD ANTI UBE-TEST EMAIL*C.34X test

[root@alpha ~]# fold -w 20 -s mailtest.txt
test XJS C4JD QADN1
NSBN3 2IDNEN GTUBE
STANDARD ANTI
UBE-TEST
EMAIL*C.34X test

[root@alpha ~]# fold -w 20 -s mailtest.txt | sed 's/N/n/g'
test XJS C4JD QADn1
nSBn3 2IDnEn GTUBE
STAnDARD AnTI
UBE-TEST
EMAIL*C.34X test

[root@alpha ~]# fold -w 20 -s mailtest.txt | sed 's/N/n/g' | tr '\n' '\0'
test XJS C4JD QADn1 nSBn3 2IDnEn GTUBE STAnDARD AnTI UBE-TEST EMAIL*C.34X test

You will need to add a newline to the very end of the file if it has one, because the tr command will remove it.

How do you make sure you are not breaking the pattern in edge cases where there isn't enough whitespace available? — rackandboneman, Jan 02 '18 at 15:00
As stated, for this to be robust there's a requirement that there is at least one space every X characters. You can do that analysis easy enough, with any blocksize you choose:
fold -w X mailtest.txt | grep -v " " | wc -l

The number it returns is the number of folded lines with potential edge cases. If it's zero, the solution is guaranteed to work. — alfreema, Jan 02 '18 at 20:53

Evan Carroll · Answer 8 · 2017-12-31T22:43:28.313

Using `perl`

Managing your own buffers

You can use IO::Handle's setvbuf to manage the default buffers, or you can manage your own buffers with sysread and syswrite. Check perldoc -f sysread and perldoc -f syswrite for more information, essentially they skip buffered io.

Here we roll our own buffer IO, but we do it manually and arbitrarily on 1024 bytes. We also open the file for RW so we do it all on the same FH at once.

use strict;
use warnings;
use Fcntl qw(:flock O_RDWR);
use autodie;
use bytes;

use constant CHUNK_SIZE => 1024 * 32;

sysopen my $fh, 'file', O_RDWR;
flock($fh, LOCK_EX);

my $chunk = 1;
while ( sysread $fh, my $bytes, CHUNK_SIZE * $chunk ) {
  if ( $bytes =~ s/<unk>/<raw_unk>/g ) {
    seek( $fh, ($chunk-1)* CHUNK_SIZE, 0 );
    syswrite( $fh, $bytes, 1024);
    seek( $fh, $chunk * CHUNK_SIZE, 0 );
  }
  $chunk++;
}

If you're going to go this route

Make sure <unk> and <raw_unk> are the same byte size.
You may want to make sure our buffered method doesn't cross the CHUNKSIZE boundary, if you're replacing more than 1 byte.

What if <unk> falls on a boundary between chunks? – liori Jan 02 '18 at 20:59 — liori, Jan 02 '18 at 20:59

score 10 · Answer 9 · answered Dec 31 '17 at 02:52

You could try bbe (binary block editor), a "sed for binary files".

I had good success using it on a 7GB text file with no EOL chars, replacing multiple occurrences of a string with one of different length. Without attempting any optimisation it gave an average processing throughput of > 50MB/s.

score 5 · Answer 10 · edited Jun 11 '20 at 14:16

5

Here's a small Go program that performs the task (unk.go):

package main
import (
    "bufio"
    "fmt"
    "log"
    "os"
)
func main() {
    const (
        pattern     = "<unk>"
        replacement = "<raw_unk>"
    )
    var match int
    var char rune
    scanner := bufio.NewScanner(os.Stdin)
    scanner.Split(bufio.ScanRunes)
    for scanner.Scan() {
        char = rune(scanner.Text()[0])
        if char == []rune(pattern)[match] {
            match++
            if match == len(pattern) {
                fmt.Print(replacement)
                match = 0
            }
        } else {
            if match > 0 {
                fmt.Print(string(pattern[:match]))
                match = 0
            }
            if char == rune(pattern[0]) {
                match = 1
            } else {
                fmt.Print(string(char))
            }
        }
    }
    if err := scanner.Err(); err != nil {
        log.Fatal(err)
    }
}

Just build it with go build unk.go and run it as ./unk <input >output.

EDIT:

Sorry, I didn't read that everything is in one line, so I tried to read the file character by character now.

EDIT II:

Applied same fix as to the C program.

edited Jun 11 '20 at 14:16

Community

1

answered Dec 29 '17 at 15:58

Patrick Bucher

775

1

does this avoid reading the entire file into memory? – cat Dec 29 '17 at 19:08
1

It reads the file character by character and never holds the entire file in the memory, just individual characters. – Patrick Bucher Dec 29 '17 at 19:10
1

scanner.Split(bufio.ScanRunes) does the magic. – Patrick Bucher Dec 29 '17 at 19:27
Also check go doc bufio.MaxScanTokenSize for the default buffer size. – Patrick Bucher Dec 29 '17 at 19:39
Like your C program, this doesn't work for replacing aardvark with zebra with an input of aaardvark. – icarus Dec 30 '17 at 21:59
@icarus: As stated in the comment to the C program, the program only deals with the one specific substitution of <unk>. Other patterns require a different logic. – Patrick Bucher Dec 31 '17 at 12:05

score 5 · Answer 11 · answered Dec 29 '17 at 21:07

5

With perl, you could work with fixed length records like:

perl -pe 'BEGIN{$/=\1e8}
          s/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new

And hope that there won't be <unk>s spanning across two of those 100MB records.

answered Dec 29 '17 at 21:07

Stéphane Chazelas

544,893

I also was thinking about this method, but using the while read -N 1000 chunk; (the 1000 picked as an example). The solution for the <unk>, broken between the chunks, is two passes through the file: the first with the 100MB chunks and the second with the '100MB + 5 byte' chunks. But it is not optimal solution in the case of the 70GB file. – MiniMax Dec 29 '17 at 22:07
3

You don't even need two passes. Read block A. While not EOF, read block B. Search/Replace in A+B. A := B. Loop. Complexity is ensuring you don't replace inside the replacement. – Chris Davies Dec 29 '17 at 23:23
@MiniMax, that second pass would not necessarily help as the first pass would have added 5 bytes for each occurrence of <unk>. – Stéphane Chazelas Dec 30 '17 at 22:32
1

@roaima, yes that would be a much more involved solution. Here it's a simple approach which is only highly probable (assuming the <unk> occurrences are far appart, if not, use $/ = ">" and s/<unk>\z/<raw_unk>/g) of being correct. – Stéphane Chazelas Dec 30 '17 at 22:35

score 1 · Answer 12 · answered Jan 04 '18 at 17:25

This may be overkill for a 70GB file and simple search & replace, but the Hadoop MapReduce framework would solve your problem right now at no cost (choose the 'Single Node' option when setting it up to run it locally) - and will can be scaled to infinite capacity in the future without the need to modify your code.

The official tutorial at https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html uses (extremely simple) Java but you can find client libraries for Perl or whatever language you feel like using.

So if later on you find that you are doing more complex operations on 7000GB text files - and having to do this 100 times per day - you can distribute the workload across multiple nodes that you provision or that are automatically provisioned for you by a cloud-based Hadoop cluster.

yes, yes it is. "Don't use Hadoop - your data isn't that big". This is a very simple streaming IO problem. — sourcejedi, Jan 04 '18 at 17:36

JJoao · Answer 13 · 2018-03-16T12:15:57.180

0

If we have a minimum amount of <unk> (as expected by Zipf's law),

awk -v RS="<unk>" -v ORS="<raw_unk>" 1

edited Mar 16 '18 at 12:15

answered Mar 16 '18 at 09:30

JJoao

12,170
1
23
45

1

No. sed reads a line at a time into memory regardless. It will not be able to fit this line. – Kusalananda Mar 16 '18 at 09:34
1

I can find no documentation that says anything other than that GNU sed will not do input/output buffering when using this flag. I can't see that it will read partial lines. – Kusalananda Mar 16 '18 at 09:53
this awk, not sed -> upvoting – botkop Dec 08 '20 at 14:57

score 0 · Answer 14 · answered May 16 '19 at 00:52

All of the previous suggestions require reading the entire file and writing the entire file. This not only takes a long time but also requires 70GB of free space.

1) If I understand you specific case correctly would it be acceptable to replace <unk> with some other string of the SAME length?

2a) Are there multiple occurrences? 2b) If so do you know how many?

I'm sure you have solved this year-plus problem already and I'd like to know what solution you used.

I'd propose a solution (most likely in C ) that would read the BLOCKS of the file searching each for the string taking into account possible block crossing. Once found replace the string with the SAME length alternate and the write only that BLOCK. Continuing for the known number of occurrences or until end of file. This would require as few as number-of-occurances writes and at most twice that (if every occurrence was split between 2 blocks). This would require NO additional space!

Replace string in a huge (70GB), one line, text file

14 Answers14

Using perl

Managing your own buffers

EDIT:

EDIT II:

Using `perl`