I am in the process of trying to upload around 20 years worth of Usenet archives to archive.org but my first batch got reject because some of the archives contained trojans that are encoded in base64. Since I have around 400GB of files to work with, fixing things manually is out of the question. All of the files are in mbox format which is plain text. My first thought was to find and replace all messages in the mbox file containing "Content-Type: application/x-msdownload". That could be pretty difficult. I'm now thinking that an easier brute force method would be to delete all base64 blocks.
From this question, I see that it's possible to find base64 blocks with grep, but I don't know how to set up the same thing with sed and that's why I'm asking. Thanks!
Edit: what I've tried so far
According to this page, ^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
should be the regex needed to find base64 text, but when I try to use that with sed, it doesn't actually work or at least it doesn't do what I expect.
Example:
cat clari.local.california.sfbay.biz.mbox | sed -e '#^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$#d' > clari.local.california.sfbay.biz.mbox.test
clari.local.california.sfbay.biz.mbox.test still contains the base64 text.
mutt
should be able to do that. Since you're working on mbox files, just deleting any base64-encoded block would likely also delete many encoded messages, images, and other data. – Kusalananda Feb 11 '22 at 07:57Does your regular expression expect that the base64 content is on a single line?
I don't know. I'm fairly unfamiliar with regex. Here's an example of a message containing malware. If I remove just the base64 encoded material, then it passes the malware checker. If it is left it, it fails. https://pastebin.com/bFyEevqC
– elmerjfudd Feb 11 '22 at 09:29Install.exe
has been removed and there are two encoded GIF images. This means there is no actual offending content to remove. – Kusalananda Feb 11 '22 at 09:32