1

I have found a long list of free email providers that I want to remove from my email lists - https://gist.github.com/tbrianjones/5992856

Below are two commands I currently use that do the same job for a handful or single domain entries however how can I convert these to import the words from another file? remove.txt for example rather than adding all of them manually.

ruby -rcsv -i -ne 'row = CSV::parse_line($_); puts $_ unless row[2] =~ /gmail|hotmail|qq.com|yahoo|live.com|comcast.com|icloud.com|aol.co/i' All.txt

sed -i '/^[^,]*,[^,]*hotmail/d' All.txt

Below is a line of the data we will be using this on

"fox*******","scott@sc***h.com","821 Ke****on Rd","Neenah","Wisconsin","54***6","UNITED STATES"
Teddy77
  • 2,983

2 Answers2

1
{   sed -ne's/./^[^,]*,[^,]*&/p' | 
    grep -vf- ./All.txt 
}   <./remove.txt >./outfile

Is what I think you are asking about. I'm not sure how it is relevant to ruby or to the line of data you're talking about...

If you want the matches to be case-insensitive then just add the -ignore case option to grep like:

{   sed -ne's/./^[^,]*,[^,]*&/p' | 
    grep -ivf- ./All.txt 
}   <./remove.txt >./outfile
mikeserv
  • 58,310
  • Hi, this doesn't seem to work if they have capitals in the email (YAhoO.com for example) – Teddy77 Jul 18 '15 at 02:33
  • @Teddy291 - Use grep -ivf- – mikeserv Jul 18 '15 at 02:35
  • Thanks. Do you know if there are any quicker ways of doing this? The file is over 80MB in size and I expect it to be down to 20-30MB after processing, it's been running for 12 hours and has only outputted 1MB. There are 3618 keywords to delete. – Teddy77 Jul 18 '15 at 18:12
  • @Teddy291 - Wow. That's pretty awful. Yes. Quicker would be to use multiple greps on a subset of the match against list. Quicker still would be to use -Fixed-string matches, but in that case you wouldn't be able to anchor and could only do: grep . remove.txt | grep -ivFf- All.txt, but in the latter case you'd drop all occurrences of any match in remove.txt no matter where it was made in your string. Safely you might split out your file to only the matchable bits: sed -ne's/[^@]*@\([^,]*\).*/\1/p' <All.txt | grep -iFxnf remove.txt | sed 's/:.*/d/' | sed -f- All.txt >out – mikeserv Jul 18 '15 at 18:23
  • If you did the above in 5-10 passes, say on 5-10 smaller chunks of remove.txt, you could probably finish the deal in 5 minutes or so. The more you can reduce tthe number of matches grep has to check against per line the better. – mikeserv Jul 18 '15 at 18:31
  • sed -ne's/[^@]@([^,])./\1/p' <1.txt | grep -iFxnf freeemails.txt | sed 's/:./d/' | sed -f- 1.txt > NoFreeEmails-1.txt - That was doing something for 15+ minutes with a 4MB file but it didn't remove anything – Teddy77 Jul 18 '15 at 19:43
1

Two steps:

  1. create a remover script (AUX) with print unless m!gmail.com!hotmail.com|...! (the regular expressio is huge but there is no problem)
  2. apply it to All.txt

Code:

perl -n0E 's/\n/|/g; say "print unless m!\\b($_ç)\\b!\n" ' remove.txt > AUX
perl -n AUX    All.txt > outfile

Update1: to be case-insensitive add a iin the match operator:

perl -n0E 's/\n/|/g; say "print unless m!@($_=)\\b!i\n" ' remove.txt > AUX

Update2 to have extra remove domains: create a new file with the exception list (extra.txt) and:

cat remove.txt extra.txt | 
  perl -n0E 's/\n/|/g; say "print unless m!@($_=)\\b!i\n" ' > AUX
perl -n AUX   All.txt > outfile
JJoao
  • 12,170
  • 1
  • 23
  • 45
  • Thank you this works great. It did it instantly, it is not processing capitals though YAhoO.cOm for example - How can I do that? – Teddy77 Jul 18 '15 at 21:28
  • @Teddy291, Please check the update. – JJoao Jul 18 '15 at 23:42
  • Thank you, sorry to bother you again but I have one more request. If I wanted to remove privacyprotect.org but only "privacy" is in the remove.txt file how would I include that? – Teddy77 Jul 19 '15 at 00:31
  • @Tedd291, in that case you could (1) add this new domain to remove.txt or (2) create an extra.txt and do cat remove extrar.txt | perl -n0E '.......\n' > AUX – JJoao Jul 19 '15 at 08:16
  • The line "privacy" is in the remove.txt or extra.txt yet emails containing domainnameprivacyemail.com for example 92977d4025d75d003b7ce0126d2eecc8@domainnameprivacyemail.com are not being removed. – Teddy77 Jul 19 '15 at 11:31