0

I am in the process of trying to upload around 20 years worth of Usenet archives to archive.org but my first batch got reject because some of the archives contained trojans that are encoded in base64. Since I have around 400GB of files to work with, fixing things manually is out of the question. All of the files are in mbox format which is plain text. My first thought was to find and replace all messages in the mbox file containing "Content-Type: application/x-msdownload". That could be pretty difficult. I'm now thinking that an easier brute force method would be to delete all base64 blocks.

From this question, I see that it's possible to find base64 blocks with grep, but I don't know how to set up the same thing with sed and that's why I'm asking. Thanks!

Edit: what I've tried so far

According to this page, ^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$ should be the regex needed to find base64 text, but when I try to use that with sed, it doesn't actually work or at least it doesn't do what I expect.

Example:

cat clari.local.california.sfbay.biz.mbox | sed -e '#^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$#d' > clari.local.california.sfbay.biz.mbox.test

clari.local.california.sfbay.biz.mbox.test still contains the base64 text.

Kusalananda
  • 333,661
  • 1
    It sounds like you may want to delete the offending attachments. For example, a scriptable MUA like mutt should be able to do that. Since you're working on mbox files, just deleting any base64-encoded block would likely also delete many encoded messages, images, and other data. – Kusalananda Feb 11 '22 at 07:57
  • I get that, but the odds that the base64 block is legitimate are quite small. I am archiving discussion newsgroups and not binary newsgroups (that would be several TB of data) where images, etc. would probably be spam and not legit. – elmerjfudd Feb 11 '22 at 08:18
  • Does your regular expression expect that the base64 content is on a single line? In my experience, the base64 code can be folded onto several lines. Also, mail messages may be base64-encoded if they contain certain character encodings, not just when the content is binary. So I still think it's much safer to go by attachment MIME-type. – Kusalananda Feb 11 '22 at 09:24
  • Does your regular expression expect that the base64 content is on a single line?

    I don't know. I'm fairly unfamiliar with regex. Here's an example of a message containing malware. If I remove just the base64 encoded material, then it passes the malware checker. If it is left it, it fails. https://pastebin.com/bFyEevqC

    – elmerjfudd Feb 11 '22 at 09:29
  • Well, in that example, the encoded file Install.exe has been removed and there are two encoded GIF images. This means there is no actual offending content to remove. – Kusalananda Feb 11 '22 at 09:32
  • I agree but clamav still sees the other base64 encoded images as malware also and so does archive.org. When I manually remove them, sees the message as clean. – elmerjfudd Feb 11 '22 at 09:41
  • related: https://stackoverflow.com/questions/77540422/how-do-i-remove-base64-strings-from-an-arbitrary-string-in-python – Charlie Parker Nov 24 '23 at 03:08

3 Answers3

1

The mutt mail user agent (MUA) can delete messages from a mailbox by MIME type. You can even script this.

A message that has an encoded attachment can be matched in mutt with the search expression ~M application. This matches any message that contains a MIME type containing the string application, usually indicating that an attachment is encoded (probably in base64). You may obviously use the more specific application/x-msdownload if you wish.

If the mailbox is called messages.mbox, you may delete all the messages in it that have any attachment that contains the string application from the command line like this:

mutt -e 'push <delete-pattern>"~M application"<enter><quit>"y"' -f messages.mbox

Note that this does not ask for any confirmation before deleting the messages from the mailbox (the "y" at the end is the reply to the question from mutt whether to delete the messages or not before quitting). You may instead want to move the messages into a separate mailbox:

mutt -e 'push <tag-pattern>"~M application"<enter><tag-prefix><save-message>bad.mbox<enter>"y"<quit>"y"' -f messages.mbox

This tags all messages that match the given search expression, saves them to the mailbox bad.mbox, and quits after deleting them from the original mailbox.

Kusalananda
  • 333,661
  • Thank you! This seems to be working. I made a temporary copy of all of the mbox files and I am now cleaning up the copies. If everything is good, I'll post the results here after they are uploaded. – elmerjfudd Feb 20 '22 at 10:12
0

Have a look at procmail, formail, and mimencode. You can easily set up complicated automated mailbox processing with those, for example to

find and replace all messages in the mbox file containing "Content-Type: application/x-msdownload".

dirkt
  • 32,309
0

(?:...) is part of Perl regexes, not either of the standard POSIX regexes. The ERE equivalent (for grep -E or sed -E) should be:

^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

See Why does my regular expression work in X but not in Y?

The regex would also match any line that has any alphanumerical string that's a multiple of four characters in length (and nothing else), so something like question, congrats or any four-letter expletive or four-letter greeting alone on a line would match. Also it doesn't allow for any whitespace at either end, and if you just remove individual lines, you might end up with messages where what remains after that is just meaningless.

Anyway, you could make it e.g. something like this to require at least five groups of four characters. That should be less likely to match random words.

^([A-Za-z0-9+/]{4}){5,}([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
ilkkachu
  • 138,973