Can't replace my regex matches

Question

I can filter files, I can can stream the matches of my regex ... However, I need to remove exactly that, from a large file.

Regex:^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

sed -e '/^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$/d/ /g' file

only streams the matches but does not replace/cut them.

I can seach for files, containing matches, also works.

What is formula to get it working?

I did split the large file, delete the partial files, which did contain my regex and assemble them reverse to the whole. However, that is more likely a slower / more dirty process over endless partial files — Olaf, Mar 10 '19 at 06:41
@Olaf Just to be clear, you want to remove the patterns matching your regex from the file. Is that correct? — Haxiel, Mar 10 '19 at 06:45
YES. In fact, that are emails, where attachments get removed, with the aging of the mails, as a huge number of clients abuses the mail-servers as storage. After 8 years and longer, several with attachments forwarded, no 35MB email with the attachments is useful. I only want to preserve the textual content — Olaf, Mar 10 '19 at 06:52

Kusalananda · Answer 1 · 2019-03-10T08:48:38.893

1

It appears that you are using a Perl-compatible Regular Expression (PCRE) with sed. The sed utility only knows Basic Regular Expressions (BRE) by default (or Extended Regular Expressions (ERE) when used with -E on most systems).

I also don't think that the sed syntax is correct, but it's difficult to read because the expression in the question seems to have extra * in them. You appear to want to strip out the multipart divider in an email message, but you don't seem to care about matching these up correctly (matching a start of one multipart part to the corresponding end divider). If the sed syntax was corrected, the expression would likely delete the full contents of the emails, or combine all attachments into the body of the message.

The PCRE expression

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

is the same as the ERE (to be used with sed -E)

^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3})?=$

and using this with d (which you appear to be doing) to would delete those lines, but the trailing / /g in your sed command is an error. Removing / /g would likely have the effect of combining all attachments into the body of the email.

If you want to strip attachments of email messages (as indicated in comments), I would not try to do it with sed but with a proper email message parser.

Examples of how to go about doing this may be found in the following related questions:

Personally, I would write a Perl script similar to the one in the first linked question/answer above. Just remember to always run test runs of such scripts on copies of you mailboxes, just in case you make mistakes.

The fdm mail tool is able to filter messages based on the number and/or size of attachments, which may be handy as a way of filtering out large email messages from archived mailboxes.

edited Mar 10 '19 at 08:48

answered Mar 10 '19 at 08:17

Kusalananda

333,661

"I also don't think that the sed syntax is correct, but it's difficult to read because the expression in the question seems to have extra * in them." - yes, this regex matches in gedit AND bluefish, as well other apps exactly the parts, I want/need remove. (eyGf7wbBqo6apbkvAzbR1Vjmt+W5hhQl5F6dSawrzVJrpvKs49ynhnzis5xS6u500JTlpZW89isn iC/kzCI13njNSQXrQNvltXaQ9WPJp6aXNJCqO6RAHORyfzqKXTZIJN1xK0sIHVTyPwrO01q2zovR ......) Until now, I just split the files on ss1='------=NextPart' ss2='------=Part' ss3='----' ss4='--=' ss5='--' – Olaf Mar 10 '19 at 08:25
@Olaf Yes, it may well do if those tools understand PCRE, but sed does not. – Kusalananda Mar 10 '19 at 08:27
to split the huge files. Obviously, when splitting here and delete the file parts containing the regex, there 5 lines missing. I neither manage to break before or after the regex - empty lines, neither I get the base64 codes (see the regex out) – Olaf Mar 10 '19 at 08:30
@Olaf As I mentioned in my answer, I would not do this with a tool that is not email-aware. – Kusalananda Mar 10 '19 at 08:31
"The fdm mail tool is able to filter messages" - That I can do via find, sort desc. A plain text mail of 1 mb is not between, 2100k small - 800k is huge for a mail, without attachment. I only look currently at files exceeding 30MB, containing 29 attached mails containing attachments – Olaf Mar 10 '19 at 08:34
No, they are running exim - open source (WHM) - so they expect, someone will mess around in structure tree' .. find -P /home///flat-dir $one_year | while read OLDMAIL;
do

echo "${OLDMAIL}" pwd if grep -q 'ds1' "${OLDMAIL}"; then echo "yes" ..." <- here looping over the files/mails, looking, where/how to split. As I can del the base64, I have to see, one of the many options, where the file/mail splits
– Olaf Mar 10 '19 at 08:36

score 0 · Answer 2 · answered Mar 10 '19 at 18:18

0

try:

sed -E "s/^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$//g" file

and double-check the output. The -E HAVE to be capital. -e doesn't work.

Once you are sure it works, use -iE instead to make the changes directly into the file

answered Mar 10 '19 at 18:18

Juan

251

I will try that in a while and give you a thumb up/down. Thx a lot – Olaf Mar 11 '19 at 08:54
sed: -e expression #1, char 69: Invalid preceding regular expression – Olaf Mar 11 '19 at 09:14

Can't replace my regex matches

2 Answers2