Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?
I process large text files, e.g., 1 million lines in inputfile, and 100k matching strings in hitfile. I have a perl script which loads the hitfile into a hash, and then checks all 'words' in each line of an inputfile, but for my workflow I'd prefer a simple command to my script.
The functionality I seek is equivalent to this:
perl -pe 's/\b(string1|string2|string3)\b)//g'
or this method of nested sed's:
sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile
or looping in the shell:
while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile
But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:
perl -Mopen=locale -Mutf8 -lpe '
BEGIN{open(A,"hitfile"); chomp(@k = <A>)}
for $w (@k){s/(^|[ ,.—_;-])\Q$w\E([ ,.—_;-]|$)/$1$2/ig}' inputfile
But are there any other tricks to do this more concisely? Some other unix command or method I'm overlooking? I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".
For example, one idea I had was to expand the words in a file into separate lines
Before:
Hello, world!
After:
¶
Hello
,
world
!
And then process with grep -vf, and then re-build/join the lines.
Any other ideas that would run fast and easy?