2

I am using the following sed command to remove emails that contain hotmail - Is it possible to check for multiple entries at the same time? Preferably load it from list.txt (one entry per line).

sed -i '/^[^,]*,[^,]*hotmail/Id' data.txt

If I can't load it from a .txt is there a way to do something like hotmail|gmail|yahoo

data.txt line example:

"foxva****omes****","scott@hotmail.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
"foxva****omes****","scott@gmail.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
"foxva****omes****","scott@foo.you","8*** Rd","Ne***ah","Wi***in","54***","*******"
Teddy77
  • 2,983

2 Answers2

3

With sed if you can format the file in the form of a sed script you can do it automatically. The following should work w/ a GNU sed. With a BSD sed it will work if you do -i '' -e for the second sed invocation...

sed -ne's|[]\*&^.$/[]|\\&|g' \
     -e's|..*|/^@&",/d|p' <./list.txt |
sed -ie'h;s/[^,]*[^@]*//' -f- -eg ./data.txt

If you do...

-e's|..*|/^@&",/Id|p' ...

...in the second line a GNU sed will delete matches for any line in list.txt case insensitively, but it will amount to a syntax error with most any other.

It tries to optimize the matches by removing the first field and everything before the first @ in the second field at the head of the script it runs for every line, then does the match checks, and, if the line makes it through all of them, gets a copy of the line it saved at the top of the script in hold space. In that way sed doesn't need to /^[^,]*,[^,]*.../ for every match. If list.txt is very long, though, it will not be a fast process regardless. grep -F should be preferred in that case (and probably in this case).


Both sed and grep can be made to perform better - and in many cases signifigantly so - if the charset used is reduced in size. For example, if you are currently in a UTF-8 locale then doing:

(   export LC_ALL=C
    sed -ne's|[]\*&^.$/[]|\\&|g' \
         -e's|..*|/^@&",/Id|p'   |
    sed -ie'h;s/[^,]*[^@]*//' -f-\
         -eg ./data.txt
)   <./list.txt

...can make a world of difference in that rather than having to consider some umpteen-thousand different characters as matches, the regex engine need only consider 128 possibilities. It should not affect the results in any way - each char is a byte in the C locale and all will get due consideration.

sed -i is not a reliable switch to use in the best of cases, and should be avoided if at all possible.


To do this w/ grep and sed -i:

(   export LC_ALL=C
    cut -d\" -f4 | cut -d@ -f2    |
    grep -Fixnf ./list.txt        |
    sed -e's|:*\([0-9]*\).*|:\1|p'\
        -e's||\1!{p;n;b\1|p'      \
        -e's||};n|'               |
    sed -nif- -e:n -e'p;n;bn'     \
        ./data.txt
)   <./data.txt

That is the quickest way I can imagine it might be done with sed's -i. It breaks down like this:

  1. cut | cut

    • The first two cuts reduce your ./data.txt input lines from/to...

     "foxva****omes****","scott@hotmail.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
    

    hotmail.com
    
  2. grep

    • grep can then compare that input against each line in its pattern -file list.txt using case -insensitive -Fixed string -x whole line matches and reporting the line -number at the head of each line of its output.
  3. sed -e

    • sed strips grep's output to nothing but the line numbers and writes out another sed script which looks like (given hypothetical grep matches on lines 10 and 20):

     :10
     10!{p;n;b10
     };n
     :20
     20!{p;n;b20
     };n
    
  4. sed -inf-

    • The last sed reads -stdin as its script and only ever executes it the one time - it doesn't backtrack and execute the script for every input line as is commonly done w/ sed scripts, but rather as it executes its script the first and only time it works its way through input - and it needs only try a single test per input line.

    • Given our previous example, for lines 1-9 sed will do:

      • If the current line is !not the 10th, {then print the current line, overwrite the current line with the next input line, and branch back to the :label named 10.
    • and for the last series of lines sed will print; then overwrite the current line with the next, the branch to the :n label until it has consumed all input.


That doesn't work if ./data.txt is very large because sed gets stuck trying to process a script input file far larger than it can reliably handle. The way around that is to take input in chunks. This can be done reliably - even in a pipeline - if you use the right kind of reader. dd is that right kind of reader.

I created a test file like this:

sh -c ' _1=\"foxva****omes****\",\"scott@
        _2='\''","8*** Rd","Ne***ah","Wi***in","54***","*******"'\''
        n=0
        for m do printf "$_1%s$_2\n$_1$((n+=1))not_free.com$_2\n" "$m"
        done
'       $(cat ~/Downloads/list.txt) >/tmp/data.txt

...where list.txt is got here per your other question. It does every other line like...

"foxva****omes****","scott@11mail.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
"foxva****omes****","scott@1not_free.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
"foxva****omes****","scott@123.com","8*** Rd","Ne***ah","Wi***in","54***","*******"
"foxva****omes****","scott@2not_free.com","8*** Rd","Ne***ah","Wi***in","54***","*******"

I then brought it up to a little over 80mbs like...

while [ "$(($(wc -c <data.txt)/1024/1024))" -lt 80 ]
do    cat <<IN >./data.txt
$(    cat ./data.txt ./data.txt)
IN
done
ls -hl ./data.txt
wc -l <./data.txt

-rw-r--r-- 1 mikeserv mikeserv 81M Jul 19 22:22 ./data.txt
925952

...and then I did...

(   trap rm\ data.tmp 0;  export  LC_ALL=C
    <./data.txt dd bs=64k cbs=512 conv=block    |
    while       dd bs=64k cbs=512 conv=unblock  \
                count=24  of=./data.tmp
                [ -s ./data.tmp ]
    do          
    <./data.tmp cut -d\" -f4 |  cut -d@  -f2    |
                grep -Fixnf ./list.txt          |
                sed -e's|:*\([0-9]*\).*|:\1|p'  \
                    -e's||\1!{p;n;b\1|p'        \
                    -e's||};n|'                 |
                sed -nf- -e:n -e'p;n;bn' ./data.tmp
    done        2>/dev/null
)|  wc -l

1293+1 records in
7234+0 records out
474087424 bytes (474 MB) copied, 21.8488 s, 21.7 MB/s
462976

You can see right there that the whole process took 22 seconds, and that the output line count is at least correct - 462976 is half of 925952 and the input should have come out halved.

The technique works because dd's reads and writes can be counted upon to the byte - even over a pipe if you know what you're about. And you can even break input out by line with the same degree of precision if you can reliably convert by a maximum line-length block-size (which is 512 here, or {_POSIX_LINE_MAX}).

The imaginative reader might rightly surmise that the same technique could be applied to an in-stream of any kind - even the live-log kind - with only a slight modification here or there (namely, to do it safely, the first dd's arguments would need to change from bs= to obs=). In every case, though, you would need some assurance of the maximum input line size, and, if a line can legitimately end in a <space> character, some additional filter mechanism inserted before the dd processes to protect against the trailing <spaces> being stripped with dd conv=unblock (which works by stripping off all trailing blanks for each cbs-sized conversion block and appending a \newline). tr and (un|)expand spring to my mind as likely candidates for such a filter.

This is not the fastest way to do this - for that you'd want to look to a -merge sort operation, I expect, but it is pretty quick, and it will work with your data. It does kind of break the sed -i thing, though - but I think that will be true no matter which way you go.

mikeserv
  • 58,310
0

You can solve this in a couple of different ways. For one, sed supports multiple expressions in a single run:

sed -i -e '/^[^,]*,[^,]*hotmail/Id' -e '/^[^,]*,[^,]*gmail/Id' -e '/^[^,]*,[^,]*yahoo/Id' data.txt

You can also do this in a single expression:

sed -i -e '/^[^,]*,[^,]*\(hotmail\|gmail\|yahoo\)/Id' data.txt

Note that the (, ), and |, all need to be escaped.

Will
  • 2,754
  • The former or the latter? – Will Jul 20 '15 at 02:10
  • Ah, thanks! I edited to correct. I hadn't realized that | needed to be escaped. I just tested this with GNU sed version 4.2.2 and it worked properly; but, you're right, without the escapes, it doesn't. Is there still something I'm missing? – Will Jul 20 '15 at 02:13
  • 1
    Just a clarification on which seds will actually handle that syntax. And the -e isn't strict;y necessary w/ any sed between /address/d statements - ; a semicolon will do. – mikeserv Jul 20 '15 at 02:15
  • Makes sense. I thought that if you were using multiple editing commands, -e was required. But different seds are different; this is for GNU sed. BSD sed doesn't support the /I modifier so I deduced he was using GNU. – Will Jul 20 '15 at 02:23
  • 1
    If you use -E you do not need to escape ()|. – mikeserv Jul 20 '15 at 02:40