1

I have a giant mbox file (8.8 GB) that contains many quoted-printable attachments, consisting of 75-character lines terminated by the "soft linebreak" sequence (equal sign, carriage return, linefeed).

I would like to search for all occurrences of a particular regular expression in the mbox content. However, any particular match may be split among multiple lines, therefore I need to delete each soft linebreak sequence before performing the regex search.

I'm having trouble figuring out the best way to do this. The solution for GNU sed that I found here fails because it apparently instructs sed to treat the entire file as one line. That one line has length is greater than INT_MAX, which causes sed to exit with an error message. I also see that there is a solution here that uses ripgrep. However, this fails to actually join the lines; the equal signs are removed but the line with the equal sign remains separate from the following line.

Brian Bi
  • 592
  • Maybe you can open it with mutt and search in it with l (lower case L to limit the display) and search with ~b your-regexp to search on the decoded contents? – Stéphane Chazelas Apr 26 '23 at 16:23
  • What exactly means "search for all occurrences of a particular regular expression"? Display the lines that match the expression? Please add a piece of example input, the regular expression and the expected result. – Bodo Apr 26 '23 at 16:23
  • 1
    Try preprocessing with perl -pe 's/=\r?\n//' (<mailbox perl -pe 's/=\r?\n//' | grep regex) – Stéphane Chazelas Apr 26 '23 at 16:33
  • google the Unix tools formail and procmail - one of them can probably help you. – Ed Morton Apr 26 '23 at 21:33

2 Answers2

2

In the end, what you're describing that you're doing is that you parse the mbox into email contents (towards which I count their attachments), and then look through those. Proper approach!

So, do that: parse the mbox file actually as mails, and you'll be somewhat happy. Here's my top-of-head, cobbled-together, untested but not red in my editor code:

#!/usr/bin/env python3
import mailbox
import re as regex

pattern = regex.compile("BRIANSREGEX")

mb = mailbox.mbox("Briansmails.mbox", create=False) for msg in mb: print(f"{msg['Subject']}") for part in msg.walk(): print(f"| {part.get_content_type()}, {part.get_content_charset()}") print("-"120) payload = part.get_payload(decode=True) charset = part.get_content_charset if type(payload) is bytes: content = "" try: content = payload.decode(charset) except: # print("failed to decode") try: content = payload.decode() # try ascii match = pattern.search(content) # do what you want with that match… if match: print(f"| matched at {match.start()}") print("-"120)

  • 1
    There are some errors here (get_content_charset is missing the () needed to call it, and we must take into account that it might return None) but this is a good approach overall; I realized that it also matches inside base64 attachments, not just quoted-printable ones. – Brian Bi Apr 27 '23 at 01:54
  • In addition, if part.is_multipart() is true, then attempting to get the payload should just be skipped (it will return None anyway). – Brian Bi Apr 27 '23 at 13:48
  • 1
    @BrianBi as said, that was just quickly written down without any testing or sense. – Marcus Müller Apr 27 '23 at 21:03
0

I've given Marcus Müller credit for his answer, but here's the version that I ultimately ended up using, which is based on his answer:

#!/usr/bin/env python3
import mailbox
import re
import sys

byte_pattern = re.compile(b"https?://[^/]imgur.com/[a-zA-Z0-9/.]") str_pattern = re.compile("https?://[^/]imgur.com/[a-zA-Z0-9/.]")

mb = mailbox.mbox(sys.argv[1], create=False) for msg in mb: for part in msg.walk(): if part.is_multipart(): continue payload = part.get_payload(decode=True) if type(payload) is bytes: # first, search it as a binary string for match in byte_pattern.findall(payload): print(match.decode('ascii')) # then, try to decode it in case it's utf-16 or something weird charset = part.get_content_charset() if charset and charset != 'utf-8': try: content = payload.decode(charset) for match in str_pattern.findall(content): print(match) except: pass else: print('failed to get message part as bytes')

Note that part may be a multipart message, i.e., not a leaf node in the tree, because walk performs a depth-first traversal that includes non-leaf nodes.

Only if it's a leaf node, I first search for my pattern as a byte string, then attempt to decode it as text using the indicated charset if it is not UTF-8. (If it is UTF-8, which is most common, it will have been found as a byte string already.)

Brian Bi
  • 592