Remove multiple strings from file on command line, high performance

Question

Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?

I process large text files, e.g., 1 million lines in inputfile, and 100k matching strings in hitfile. I have a perl script which loads the hitfile into a hash, and then checks all 'words' in each line of an inputfile, but for my workflow I'd prefer a simple command to my script.

The functionality I seek is equivalent to this:

perl -pe 's/\b(string1|string2|string3)\b)//g'

or this method of nested sed's:

sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile

or looping in the shell:

while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile

But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:

perl -Mopen=locale -Mutf8 -lpe '
  BEGIN{open(A,"hitfile"); chomp(@k = <A>)} 
  for $w (@k){s/(^|[ ,.—_;-])\Q$w\E([ ,.—_;-]|$)/$1$2/ig}' inputfile

But are there any other tricks to do this more concisely? Some other unix command or method I'm overlooking? I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".

For example, one idea I had was to expand the words in a file into separate lines

Before:

Hello, world!

After:

¶
Hello
, 
world
!

And then process with grep -vf, and then re-build/join the lines.

Any other ideas that would run fast and easy?

What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O? — undercat, Jul 10 '18 at 19:08
My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file. — Michael Douma, Jul 10 '18 at 19:28
I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation. — user1934428, Jul 11 '18 at 06:34
I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line. — Michael Douma, Jul 12 '18 at 07:24

haukex · Accepted Answer · 2018-07-12T08:58:20.830

How big is your hitfile exactly? Could you show some actual examples of what you're trying to do? Since you haven't provided more details on your input data, this is just one idea to try out and benchmark against your real data.

Perl regexes are capable of becoming pretty big, and a single regex would allow you to modify the input file in a single pass. Here, I'm using /usr/share/dict/words as an example for building a huge regex, mine has ~99k lines and is ~1MB big.

use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do {
    open my $wfh, '<', '/usr/share/dict/words' or die $!;
    chomp( my @words = <$wfh> );
    map { qr/\b(?:$_)\b/ } join '|', map {quotemeta}
        sort { length $b <=> length $a or $a cmp $b } @words };

while (<>) {
    s/$big_regex//g;
    print;
}

I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".

If "pine" should not match "pineapple", you need to check the characters before and after the occurrence of "pine" in the input as well. While certainly possible with fixed string methods, it sounds like the regex concept of word boundaries (\b) is what you're after.

Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.

I'm not sure I agree with this sentiment. What's wrong with perl script.pl? You can use it with shell redirections/pipes just like a one-liner. Putting code into a script will unclutter your command line, and allow you to do complex things without trying to jam it all into a one-liner. Plus, short does not necessarily mean fast.

Another reason you might want to use a script is if you have multiple input files. With the code I showed above, building the regex is fairly expensive, so calling the script multiple times will be expensive - processing multiple files in a single script will eliminate that overhead. I love the UNIX principle, but for big data, calling multiple processes (sometimes many times over) and piping data between them is not always the most efficient method, and streamlining it all in a single program can help.

Update: As per the comments, enough rope to shoot yourself in the foot Code that does the same as above in a one-liner:

perl -CDS -ple 'BEGIN{local$/;($r)=map{qr/\b(?:$_)\b/}join"|",map{quotemeta}sort{length$b<=>length$a}split/\n/,<>}s/$r//g' /usr/share/dict/words input.txt

Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner. — Michael Douma, Jul 12 '18 at 07:29
@MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner ☺ — haukex, Jul 12 '18 at 08:57

Remove multiple strings from file on command line, high performance

1 Answers1