0

I have files that I am trying to standardize without success because I cannot seem to find a pattern match with SED. in Notepad++, I can clearly see the CRLF at the end of the line.

Sample text

When non-printable characters are viewed, I get a ^M with cat and a ^M or \r at the end of the line.

enter image description here

In Notepad++ I can search for \r\n\h+ and deleted the carriage return along with all the whitespaces to concatenate the CC: all on one line, (sometime there can a few line break)

I think have tried every combination via SED whiteout success. I also reviewed this link https://stackoverflow.com/questions/3569997/how-to-find-out-line-endings-in-a-text-file

What I am missing?

Tried and failed examples

sed -En 's/\r\s+//g' $NewFile
sed -En 's/\r +//g' $NewFile
sed -En 's/\r\n +//g' $NewFile
sed -En 's/\n +//g' $NewFile
αғsнιη
  • 41,407
user68650
  • 343

3 Answers3

4

The simple fix is to use a regular expression which searches for \r at end of line (which is processed by sed when it reads the file, but can be detected with the regex anchor $):

sed 's/\r$//' file.dos >file.txt

If your sed doesn't support the literal \r, try using a Bash "C-style" string, which embeds the literal carriage return character in your sed script:

sed $'s/\r$//' file.dos >file.txt

... or use Awk, where the symbolic representation is standard:

awk '{ sub(/\r$/, "") } 1' file.dos >file.txt

... or use the dos2unix tool, which has the logic encapsulated built into the binary:

dos2unix file.dos

See also https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings

After this preprocessing, you can use a standard email tool for extracting headers. Procmail offers formail -c (and maybe add -z too) to fold back all header lines, or you can use a simple Awk one-liner.

awk '!body { if(NR > 1 && $0 ~ /^[^ \n]/) printf "\n"; printf "%s", $0 }
  /^$/ { printf "\n"; body=1 }
  body'

If you wanted to, you can add the sub action from the earlier Awk solution to the top of this script, of course.

Demo: https://ideone.com/1z8uOU

tripleee
  • 7,699
  • Why limit to the end of the line? You might as well just do sed 's/\r//g' file. – terdon Jan 02 '23 at 10:54
  • Or then use tr -d '\015' <file but that's not really what the OP was asking. – tripleee Jan 02 '23 at 10:54
  • 1
    @terdon sed 's/\r//g' file would delete ALL \rs in the file, not just the ones at the end of a line. There's nothing special about a \r in the middle of a line and no obvious reason why it should be deleted. – Ed Morton Jan 03 '23 at 00:18
  • @EdMorton yes, exactly. Which is what one wants in all but extremely rare edge cases. This is a classic issue which 9 times out of 10 is caused by opening the Duke kn Windows. – terdon Jan 03 '23 at 09:53
0

sed only looks at one line at a time by default, and doesn't even include the newline in the buffer, so you can't do stuff like s/\r\n/.../, let alone s/\r\n +// the simple way. You'll have to arrange for the whole file to be processed in one go.

I'm not sure if that can be done easily in sed, but at least with GNU sed, you can use the -z option to have it use the NUL byte as a separator instead of the newline. Text files shouldn't have NULs, so in practice that has the effect of reading the whole file in.

E.g. with this input file:

$ cat -A foo.txt
from: foo bar^M$
to: someone ^M$
  someone else^M$
cc: something ^M$
   something else^M$
^M$

Something like this might work:

$ sed -z -Ee 's/\r\n +//g' foo.txt |cat -A
from: foo bar^M$
to: someone someone else^M$
cc: something something else^M$
^M$

Or you could use Perl, which can be told to read the whole file in, with the -0777 option.

$ perl -0777 -pe 's/\r\n +//g' foo.txt |cat -A
from: foo bar^M$
to: someone someone else^M$
cc: something something else^M$
^M$

I'm not exactly sure what the processing rules for email headers are, though, so I won't comment on that. But note that there are existing tools/libraries/modules for processing emails. Also, what you're doing here would mangle the message data too, if you have that in the same file.

ilkkachu
  • 138,973
  • Email folding indeed works like this; a line which starts with a space or a tab is semantically joined with the previous line, up until the first empty line, which is the end of the headers. A complication is MIME multipart messages, in which each body part also has a set of headers. – tripleee Jan 02 '23 at 15:51
  • ... But the sample doesn't look like proper RFC5322 headers; those would have Date: instead of Sent: and a different date format, and <brokets> instead of [square brackets] around the email addresses. The OP's sample looks like some sort of Microsoft ad-hockery. – tripleee Jan 02 '23 at 15:55
  • @tripleee, yep. I was thinking about the details, like if it removed just one space at the start of the line or all of them, etc. I could have checked, but I didn't think it was that important. – ilkkachu Jan 02 '23 at 16:12
  • Unquoted whitespace is insignificant, so you can trim to just one space if you like. A proper solution would also need to decode any RFC2047 encoding, but at that point, probably switch to a language with a proper email library, like Python. Or, like we use to say in the trade, Subject: =?UTF-8?B? dHJ5IFB5dGhvbg==?= – tripleee Jan 02 '23 at 16:25
0

First, thank you to all that responded and guided me in the right direction, I would not have been able accomplish this let alone understand the process when SED is used. I did try the -z option and dos2unix utility as suggested by @seshoumara, but for some reason I could not seem to get it working. @ilkkachu was informative as well.

So, after a long thorough search, I came up with the following. the first line extracts the and email header until a blank like is found and the second line normalizes it.

sed -n '0,/^\r/p' $f | tee bk/$NewFile  
sed -Ei ':Loop ; $!N ; s/\n\s+/ / ; tLoop ; P ; D' bk/$NewFile

line ending were the biggest issues and when to use \r or \n or \r\n and using the dos2unix fixed the issue. I am unsure of why i was able to user \r with the first command and not the second (I will save that for later).

user68650
  • 343