1

We have a very manual and rudimentary email unsubscribe system that I am trying to automate. There is a file containing a list of the email addresses to unsubscribe, the format of this file is that there is one email address per line so cat should be able to be used I guess.

In the same folder, there are thousands of ".eml" files (raw email files) that are sent to sendmail in batch. Generating these .eml files is expensive so we keep them in a folder and send them periodically until a person unsubscribes. What I am trying to do is write a bash script looping through all the email addresses in the file, running grep on the folder for every email address, and deleting the file for which grep matches.

Since my Unix skills are very limited, I am trying to so as a reusable bash script (with loops and such) so that I get to improve my Unix skills

Samuel
  • 13
  • 1
    My reading of the question was that if fooXXX.eml contains myaddress@domain.com, it should be deleted. – NickD Jan 31 '20 at 19:21
  • 1
    Thank you for your interest. No, the name of the files are non descriptive and not helpful at all, the email addresses are in the content of the .eml files. – Samuel Jan 31 '20 at 19:21
  • @NickD indeed, this is exactly this. – Samuel Jan 31 '20 at 19:21
  • I suppose those .eml files are some sort of rfc822 format with header and body. And the email addresses are found in the To: or Cc: headers. Is the syntax consistent? For instance, is always in the To: header and with <, > brackets around the email address and on the same line as the To: prefix? Are email addresses potentially found elsewhere in the email (like email attachements containing forwarded emails, or in quoted parts or in From/Reply-to headers...)? – Stéphane Chazelas Jan 31 '20 at 19:25
  • @StéphaneChazelas Thank you for your interest, the syntax of the headers is consistent. There are no brackets only the email address directly in the To field. Moreover email addresses can be find in a tracking image having the parameter "http://path...../image.png?fooparam=random@email.com". – Samuel Jan 31 '20 at 19:30
  • Your question is similar to mine https://unix.stackexchange.com/questions/564239/how-to-remove-files-which-either-dont-contain-a-string-or-cant-be-opened – Tim Jan 31 '20 at 23:50

2 Answers2

1

A naive approach would be (assuming GNU utilities):

grep -FZlw -f address.list -- *.eml | xargs -r0 rm -f --

Or the same but with the long options as supported by GNU utilities:

grep --fixed-strings \
     --null --files-with-matches \
     --word-regexp \
     --file address.list \
     -- *.eml |
 xargs --no-run-if-empty --null \
   rm --force --

But that would delete files when addresses are found anywhere in the file, whether it's in the From:, To:, Cc:, Reply-To headers, or in the body of the email or in attachments.

Also if the address.list contains, doe@example.com, that would also delete emails for john.doe@example.com and doe@example.com.eu.

That also assumes email addresses are formatted the same (same case, no MIME encoding) in the address.list and in the eml files.

If you know exactly how the emails are formatted, for instance if they're always going to contain one and only one occurrence of a line like:

To: address@example.com

Where address@example.com is formatted exactly like in your address.list, then you can do:

sed 's/^/To: /' address.list | grep -xZFlf - -- *.eml | xargs -r0 rm -f --

Which would be more reliable.

Instead of passing the address.list as a list of words to be found anywhere in the files, we're transforming the search list first with the stream editor command to prefix each line with "To: " so that the fixed string patterns become To: address@example.com and using -x/--line-regexp for those (instead of -w/--word-regexp) to match the full contents of lines exactly. (so To: address@example.com doesn't match on Reply-To: address@example.com.eu for instance).

Replace rm -f with grep -H '^To:' above if instead of removing the files, you want to check what the To: header is for the files that are to be removed.

  • Thank you very much Stéphane, this is very helpful and instructive. Unfortunately I cannot upvote because I do not have enough reputation points, I have accepted the other answer because I was trying to create a reusable script using programming patterns that I would feel able to come out on my own next time. Your solution is very instructive, I will keep studying it and while at this minute using your solution would be using like using a piece of magic for me, which I am trying to avoid to learn, I hope to feel comfortable leveraging the power of these utilities and advance syntax soon. – Samuel Jan 31 '20 at 19:53
  • Thanks. Your answer may also have given some information for https://unix.stackexchange.com/questions/564239/how-to-remove-files-which-either-dont-contain-a-string-or-cant-be-opened – Tim Jan 31 '20 at 23:51
  • I ended up using this solution for performance reasons, we have tens of thousands of files and it looks like this solution only parses every file once which is ideal from a performance perspective. However I am wondering: what are the multiple "--" for ? (for grep and xargs) – Samuel Feb 01 '20 at 06:01
  • I found my answer here: https://unix.stackexchange.com/questions/11376/what-does-double-dash-mean – Samuel Feb 01 '20 at 06:05
  • @Samuel, strictly speaking, the -- is for grep and rm, not xargs here. See also my edit for some explanation. – Stéphane Chazelas Feb 01 '20 at 09:00
0

Using the following script:

#!/bin/bash

email_dir=./emails
unsubscribe_file=./emails/unsubscribe.txt

while IFS= read -r email _; do
    files=($(grep -rni "$email" "$email_dir" | grep -v 'unsubscribe.txt'))
    if ((${#files[@]}>1)); then
        printf '%s\n' "warning: Found multiple files for: $email" "${files[@]}" >&2
    elif ((${#files[@]}==1)); then
        rm "$(echo "${files[0]}" | awk -F\: '{print $1}')"
    fi
done < "$unsubscribe_file"

email_dir should be set to the path to the directory containing the emails unsubscribe_file should be set to the path of the file containing the emails to be unsubscribed

The while loop will read through the unsubscribe file and for each line set the email variable to the first field (which should be the only field but _ will catch anything leftover should it exist)

We will the grep within all files of the email_dir directory for that email address (which will also return the unsubscribe file so we are using grep to remove that from our results. It would be ideal if that wasn't in the same directory. However be sure to change grep -v 'unsubscribe.txt' to reflect the actual name of your unsubscribe file)

We set these results to an array in case there is more than one result. In which case it will throw an error and remove nothing. If there is only 1 result we will extract the filename from the grep output and remove it.

jesse_b
  • 37,005
  • Thank you so much Jesse, this is exactly what I was trying to achieve, in light of your explanations I feel able to reproduce a similar solution on my own tomorrow, which I wouldn't feel able to do with a GNU utils solution using complex parameters. Thank you again this is so helpful – Samuel Jan 31 '20 at 19:48
  • I am studying the code to learn, what is the underscore for in IFS= read -r email _; ? Also can you please break down the syntax #varName[@] if you have a couple of seconds, that would be very helpful to me – Samuel Jan 31 '20 at 20:01
  • With read "The line is split into fields as with word splitting, and the first word is assigned to the first NAME, the second word to the second NAME, and so on, with any leftover words assigned to the last NAME." So in your case this shouldn't be an issue but say for example one of the lines in your unsubscribe file was jesse@yahoo.com foobar baz. Without the underscore it would produce email='jesse@yahoo.com foobar baz with the underscore it would create email=jesse@yahoo.com; _='foobar baz' – jesse_b Jan 31 '20 at 20:04
  • "${#files[@]}" will print the length of the files array (number of elements). This is documented in the bash manual under Shell Parameter Expansion – jesse_b Jan 31 '20 at 20:05
  • Thank you very much, I really appreciate your time – Samuel Jan 31 '20 at 20:05