I have a giant mbox file (8.8 GB) that contains many quoted-printable attachments, consisting of 75-character lines terminated by the "soft linebreak" sequence (equal sign, carriage return, linefeed).
I would like to search for all occurrences of a particular regular expression in the mbox content. However, any particular match may be split among multiple lines, therefore I need to delete each soft linebreak sequence before performing the regex search.
I'm having trouble figuring out the best way to do this. The solution for GNU sed that I found here fails because it apparently instructs sed to treat the entire file as one line. That one line has length is greater than INT_MAX
, which causes sed to exit with an error message. I also see that there is a solution here that uses ripgrep. However, this fails to actually join the lines; the equal signs are removed but the line with the equal sign remains separate from the following line.
mutt
and search in it withl
(lower case L tolimit
the display) and search with~b your-regexp
to search on the decoded contents? – Stéphane Chazelas Apr 26 '23 at 16:23perl -pe 's/=\r?\n//'
(<mailbox perl -pe 's/=\r?\n//' | grep regex
) – Stéphane Chazelas Apr 26 '23 at 16:33formail
andprocmail
- one of them can probably help you. – Ed Morton Apr 26 '23 at 21:33