9

I have a file containing a list of words. I want to remove all occurrences of all the words in this file from a big text file.

Example:

File 1

queen
king

Text file sample

Both the king and queen are monarchs. Will the queen live? Queen, it is!

This is what I have tried:

sed -i 's/queen/ /g' page.txt
sed -i 's/Queen/ /g' page.txt

Output

Both the and are monarchs. Will the live? , it is!

The list of words I have is big (over 50000 words). How can I do this without having to specify the pattern in the command line?

  • So you need this to be i) case insensitive and ii) ignore punctuation (queen, matches queen)? How about substrings? Should king match hiking? Or high-king? – terdon Nov 10 '16 at 12:26
  • 1
    What have you tried so far? Where did you get stuck? See http://unix.stackexchange.com/q/112023/135943 if you're just starting out. (Show some effort and you're more likely to get help; there is a wealth of information on this site already, very easily searchable.) – Wildcard Nov 10 '16 at 12:26
  • @terdon Just full string matches. punctuation is ignored. –  Nov 10 '16 at 12:32
  • @Wildcard I have been able to use sed to remove all occurrences of a single word specified in the command line. I am not sure how to do that if there are multiple words, to be read from a file. –  Nov 10 '16 at 12:33
  • 1
    Excellent. Please [edit] your question and add the command you've used and how it fails in this case. – terdon Nov 10 '16 at 12:34
  • As of How can I do this without having to specify the pattern in the command line? sed allows to submit expressions from a file via its option -f. You could even write sed "programs" by prepending #!/usr/bin/sed -f to such a file and giving it execute permissions. However, with 50000 expressions I agree that sed seems to be an inappropriate choice. – countermode Nov 10 '16 at 14:02

3 Answers3

7

For your actual use case I recommend terdon's answer using Perl.

However, the simple version, without handling words that are substrings of other words (e.g. removing "king" from "hiking"), is to use one Sed command to generate the command run by a different Sed instance on your actual file.

In this case, with wordfile containing "king" and "queen" and textfile containing your text:

sed -e "$(sed 's:.*:s/&//ig:' wordfile)" textfile

Note that the "ignore case" flag is a GNU extension, not standard.

Wildcard
  • 36,499
4

The simple but inefficient way is to process the file multiple times, once for each input word:

$ while read w; do sed -i "s/$w//ig" file2 ; done < file1
$ cat file2
Both the  and  are monarchs. Will the  live? , it is!

That can be very slow for large files though (and also matches substrings). You could do it in a single pass with Perl:

perl -lpe 'BEGIN{open(A,"file1"); chomp(@k = <A>)} 
                 for $w (@k){s/\b\Q$w\E\b//ig}' file2 

The \b make sure we only match on word boundaries, \Q\E make sure $w is taken literally. This will stop the script from matching hiking but it will still match high-king. To avoid that, you need to explicitly list the characters that define a word:

perl -Mopen=locale -Mutf8 -lpe '
  BEGIN{open(A,"file1"); chomp(@k = <A>)} 
  for $w (@k){s/(^|[ ,.—_;-])\Q$w\E([ ,.—_;-]|$)/$1$2/ig}' file2 

That non-ASCII character above needs to be entered in UTF-8 encoding, as we're telling perl the code is written in UTF-8 with -Mutf8. We're using -Mopen=locale for the content of the files and stdout to be decoded/encoded in the locale's character set.

terdon
  • 242,166
1

save this script to file d: (DOWNLOAD FROM GITHUB GIST)

#!/bin/bash

LIST=${1:?"LIST word"}
FILE=${2:?"FILE name not set"}

L=$( sed -e ':a;N;$!ba;s_\n_\x00_g' ${LIST}|sed -e 's_\x00_ \\|_g' -e's_\(\\|\)*$__g')
P='s_\('$L'\)__ig'
O="sed -e '$P'  ${FILE}"

eval "${O}"

then run it:

bash ./d LIST FILE 

if you want save file you can run this command:

bash ./d LIST FILE  | tee NewFILE

OR

bash ./d LIST FILE > NewFile

i read LIST WORD and change it to regex foramt, for example i change your queen and king and test to this format:

queen\|king\|test

then create sed command with this parameter:

sed -e 's_\(queen\|king\|test\) *__ig' FILE

with this bash script we read once LISTWORD and once FILE for replace

Baba
  • 3,279